3. Data visualisation

3.2.4 Exercises

  1. Run ggplot(data = mpg). What do you see?
ggplot(data = mpg)

Nothing. The plot is empty because no layers have been added to ggplot().

  1. How many rows are in mpg? How many columns?
mpg
## # A tibble: 234 x 11
##    manufacturer model    displ  year   cyl trans   drv     cty   hwy fl   
##    <chr>        <chr>    <dbl> <int> <int> <chr>   <chr> <int> <int> <chr>
##  1 audi         a4        1.80  1999     4 auto(l… f        18    29 p    
##  2 audi         a4        1.80  1999     4 manual… f        21    29 p    
##  3 audi         a4        2.00  2008     4 manual… f        20    31 p    
##  4 audi         a4        2.00  2008     4 auto(a… f        21    30 p    
##  5 audi         a4        2.80  1999     6 auto(l… f        16    26 p    
##  6 audi         a4        2.80  1999     6 manual… f        18    26 p    
##  7 audi         a4        3.10  2008     6 auto(a… f        18    27 p    
##  8 audi         a4 quat…  1.80  1999     4 manual… 4        18    26 p    
##  9 audi         a4 quat…  1.80  1999     4 auto(l… 4        16    25 p    
## 10 audi         a4 quat…  2.00  2008     4 manual… 4        20    28 p    
## # ... with 224 more rows, and 1 more variable: class <chr>

234 rows and 11 columns in total.

  1. What does the drv variable describe? Read the help for ?mpg to find out.

drv describes whether the car is front-wheel drive, rear wheel drive, or 4wd.

  1. Make a scatterplot of hwy vs cyl.
ggplot(mpg, aes(x = hwy, y = cyl)) +
  geom_point()

  1. What happens if you make a scatterplot of class vs drv? Why is the plot not useful?
ggplot(mpg, aes(x = class, y = drv)) +
  geom_point()

Scatterplots are useful for displaying continuous variables (e.g., cty and hwy). class and drv are categorical variables.

3.3.1 Exercises

  1. What’s gone wrong with this code? Why are the points not blue?

    ggplot(data = mpg) + 
      geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))

To make the points blue, colour must be set manually (i.e., it must be located outside aes()):

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

  1. Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?
mpg
## # A tibble: 234 x 11
##    manufacturer model    displ  year   cyl trans   drv     cty   hwy fl   
##    <chr>        <chr>    <dbl> <int> <int> <chr>   <chr> <int> <int> <chr>
##  1 audi         a4        1.80  1999     4 auto(l… f        18    29 p    
##  2 audi         a4        1.80  1999     4 manual… f        21    29 p    
##  3 audi         a4        2.00  2008     4 manual… f        20    31 p    
##  4 audi         a4        2.00  2008     4 auto(a… f        21    30 p    
##  5 audi         a4        2.80  1999     6 auto(l… f        16    26 p    
##  6 audi         a4        2.80  1999     6 manual… f        18    26 p    
##  7 audi         a4        3.10  2008     6 auto(a… f        18    27 p    
##  8 audi         a4 quat…  1.80  1999     4 manual… 4        18    26 p    
##  9 audi         a4 quat…  1.80  1999     4 auto(l… 4        16    25 p    
## 10 audi         a4 quat…  2.00  2008     4 manual… 4        20    28 p    
## # ... with 224 more rows, and 1 more variable: class <chr>

Categorical: manufacturer, model, trans, drv, fl, class
Continuous: displ, year, cty, hwy

This information is located below the variable name in the output (e.g., <chr> indicates a character string which is categorical).

  1. Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?

  2. What happens if you map the same variable to multiple aesthetics?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, colour = displ))

It creates a gradient along whichever axis the variable is assigned to.

  1. What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)

stroke increases the border width of shapes. However, not all shapes have a border.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), stroke = 2, shape = 23)

  1. What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, colour = displ < 5))

In this example, each observation with displ < 5 is grouped together (TRUE). All remaining observations form a second group (FALSE).

4. Workflow: basics

4.4 Practice

  1. Why does this code not work?

    my_variable <- 10
    my_varıable
    ## Error in eval(expr, envir, enclos): object 'my_varıable' not found

    Look carefully! (This may seem like an exercise in pointlessness, but training your brain to notice even the tiniest difference will pay off when programming.)

There is a typo on the second line. It should be my_variable, not my_varıable.

  1. Tweak each of the following R commands so that they run correctly:

    library(tidyverse)
    
    ggplot(dota = mpg) + 
      geom_point(mapping = aes(x = displ, y = hwy))
    
    fliter(mpg, cyl = 8)
    filter(diamond, carat > 3)
library(tidyverse)

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy))

filter(mpg, cyl == 8)
## # A tibble: 70 x 11
##    manufacturer model     displ  year   cyl trans  drv     cty   hwy fl   
##    <chr>        <chr>     <dbl> <int> <int> <chr>  <chr> <int> <int> <chr>
##  1 audi         a6 quatt…  4.20  2008     8 auto(… 4        16    23 p    
##  2 chevrolet    c1500 su…  5.30  2008     8 auto(… r        14    20 r    
##  3 chevrolet    c1500 su…  5.30  2008     8 auto(… r        11    15 e    
##  4 chevrolet    c1500 su…  5.30  2008     8 auto(… r        14    20 r    
##  5 chevrolet    c1500 su…  5.70  1999     8 auto(… r        13    17 r    
##  6 chevrolet    c1500 su…  6.00  2008     8 auto(… r        12    17 r    
##  7 chevrolet    corvette   5.70  1999     8 manua… r        16    26 p    
##  8 chevrolet    corvette   5.70  1999     8 auto(… r        15    23 p    
##  9 chevrolet    corvette   6.20  2008     8 manua… r        16    26 p    
## 10 chevrolet    corvette   6.20  2008     8 auto(… r        15    25 p    
## # ... with 60 more rows, and 1 more variable: class <chr>
filter(diamonds, carat > 3)
## # A tibble: 32 x 10
##    carat cut     color clarity depth table price     x     y     z
##    <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  3.01 Premium I     I1       62.7   58.  8040  9.10  8.97  5.67
##  2  3.11 Fair    J     I1       65.9   57.  9823  9.15  9.02  5.98
##  3  3.01 Premium F     I1       62.2   56.  9925  9.24  9.13  5.73
##  4  3.05 Premium E     I1       60.9   58. 10453  9.26  9.25  5.66
##  5  3.02 Fair    I     I1       65.2   56. 10577  9.11  9.02  5.91
##  6  3.01 Fair    H     I1       56.1   62. 10761  9.54  9.38  5.31
##  7  3.65 Fair    H     I1       67.1   53. 11668  9.53  9.48  6.38
##  8  3.24 Premium H     I1       62.1   58. 12300  9.44  9.40  5.85
##  9  3.22 Ideal   I     I1       62.6   55. 12545  9.49  9.42  5.92
## 10  3.50 Ideal   H     I1       62.8   57. 12587  9.65  9.59  6.03
## # ... with 22 more rows
  1. Press Alt + Shift + K. What happens? How can you get to the same place using the menus?

The Keyboard Shortcut Quick Reference overlay appears.

5. Data transformation

5.2.4 Exercises

  1. Find all flights that

    1. Had an arrival delay of two or more hours
    2. Flew to Houston (IAH or HOU)
    3. Were operated by United, American, or Delta
    4. Departed in summer (July, August, and September)
    5. Arrived more than two hours late, but didn’t leave late
    6. Were delayed by at least an hour, but made up over 30 minutes in flight
    7. Departed between midnight and 6am (inclusive)

Had an arrival delay of two or more hours:

filter(flights, arr_delay >= 120)
## # A tibble: 10,200 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      811            630      101.     1047
##  2  2013     1     1      848           1835      853.     1001
##  3  2013     1     1      957            733      144.     1056
##  4  2013     1     1     1114            900      134.     1447
##  5  2013     1     1     1505           1310      115.     1638
##  6  2013     1     1     1525           1340      105.     1831
##  7  2013     1     1     1549           1445       64.     1912
##  8  2013     1     1     1558           1359      119.     1718
##  9  2013     1     1     1732           1630       62.     2028
## 10  2013     1     1     1803           1620      103.     2008
## # ... with 10,190 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Flew to Houston (IAH or HOU):

filter(flights, dest %in% c("IAH", "HOU"))
## # A tibble: 9,313 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515        2.      830
##  2  2013     1     1      533            529        4.      850
##  3  2013     1     1      623            627       -4.      933
##  4  2013     1     1      728            732       -4.     1041
##  5  2013     1     1      739            739        0.     1104
##  6  2013     1     1      908            908        0.     1228
##  7  2013     1     1     1028           1026        2.     1350
##  8  2013     1     1     1044           1045       -1.     1352
##  9  2013     1     1     1114            900      134.     1447
## 10  2013     1     1     1205           1200        5.     1503
## # ... with 9,303 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Were operated by United, American, or Delta:

filter(flights, carrier %in% c("UA", "AA", "DL"))
## # A tibble: 139,504 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515        2.      830
##  2  2013     1     1      533            529        4.      850
##  3  2013     1     1      542            540        2.      923
##  4  2013     1     1      554            600       -6.      812
##  5  2013     1     1      554            558       -4.      740
##  6  2013     1     1      558            600       -2.      753
##  7  2013     1     1      558            600       -2.      924
##  8  2013     1     1      558            600       -2.      923
##  9  2013     1     1      559            600       -1.      941
## 10  2013     1     1      559            600       -1.      854
## # ... with 139,494 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

(To find the two letter carrier abbreviations, see airlines.)

Departed in summer (July, August, and September):

filter(flights, month %in% c(7, 8, 9))
## # A tibble: 86,326 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     7     1        1           2029      212.      236
##  2  2013     7     1        2           2359        3.      344
##  3  2013     7     1       29           2245      104.      151
##  4  2013     7     1       43           2130      193.      322
##  5  2013     7     1       44           2150      174.      300
##  6  2013     7     1       46           2051      235.      304
##  7  2013     7     1       48           2001      287.      308
##  8  2013     7     1       58           2155      183.      335
##  9  2013     7     1      100           2146      194.      327
## 10  2013     7     1      100           2245      135.      337
## # ... with 86,316 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Arrived more than two hours late, but didn’t leave late:

filter(flights, arr_delay >= 120 & dep_delay <= 0)
## # A tibble: 29 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1    27     1419           1420       -1.     1754
##  2  2013    10     7     1350           1350        0.     1736
##  3  2013    10     7     1357           1359       -2.     1858
##  4  2013    10    16      657            700       -3.     1258
##  5  2013    11     1      658            700       -2.     1329
##  6  2013     3    18     1844           1847       -3.       39
##  7  2013     4    17     1635           1640       -5.     2049
##  8  2013     4    18      558            600       -2.     1149
##  9  2013     4    18      655            700       -5.     1213
## 10  2013     5    22     1827           1830       -3.     2217
## # ... with 19 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Were delayed by at least an hour, but made up over 30 minutes in flight:

filter(flights, dep_delay >= 60  & dep_delay - arr_delay > 30)
## # A tibble: 1,844 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1     2205           1720      285.       46
##  2  2013     1     1     2326           2130      116.      131
##  3  2013     1     3     1503           1221      162.     1803
##  4  2013     1     3     1839           1700       99.     2056
##  5  2013     1     3     1850           1745       65.     2148
##  6  2013     1     3     1941           1759      102.     2246
##  7  2013     1     3     1950           1845       65.     2228
##  8  2013     1     3     2015           1915       60.     2135
##  9  2013     1     3     2257           2000      177.       45
## 10  2013     1     4     1917           1700      137.     2135
## # ... with 1,834 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Departed between midnight and 6am (inclusive):

filter(flights, dep_time <= 600 | dep_time == 2400)
## # A tibble: 9,373 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515        2.      830
##  2  2013     1     1      533            529        4.      850
##  3  2013     1     1      542            540        2.      923
##  4  2013     1     1      544            545       -1.     1004
##  5  2013     1     1      554            600       -6.      812
##  6  2013     1     1      554            558       -4.      740
##  7  2013     1     1      555            600       -5.      913
##  8  2013     1     1      557            600       -3.      709
##  9  2013     1     1      557            600       -3.      838
## 10  2013     1     1      558            600       -2.      753
## # ... with 9,363 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
  1. Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?

It is a shortcut for finding values that fall within a specified range:

filter(flights, between(month, 7, 9))
## # A tibble: 86,326 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     7     1        1           2029      212.      236
##  2  2013     7     1        2           2359        3.      344
##  3  2013     7     1       29           2245      104.      151
##  4  2013     7     1       43           2130      193.      322
##  5  2013     7     1       44           2150      174.      300
##  6  2013     7     1       46           2051      235.      304
##  7  2013     7     1       48           2001      287.      308
##  8  2013     7     1       58           2155      183.      335
##  9  2013     7     1      100           2146      194.      327
## 10  2013     7     1      100           2245      135.      337
## # ... with 86,316 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
  1. How many flights have a missing dep_time? What other variables are missing? What might these rows represent?
filter(flights, is.na(dep_time))
## # A tibble: 8,255 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1       NA           1630        NA       NA
##  2  2013     1     1       NA           1935        NA       NA
##  3  2013     1     1       NA           1500        NA       NA
##  4  2013     1     1       NA            600        NA       NA
##  5  2013     1     2       NA           1540        NA       NA
##  6  2013     1     2       NA           1620        NA       NA
##  7  2013     1     2       NA           1355        NA       NA
##  8  2013     1     2       NA           1420        NA       NA
##  9  2013     1     2       NA           1321        NA       NA
## 10  2013     1     2       NA           1545        NA       NA
## # ... with 8,245 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

8,255 flights.

To see what other variables are missing, you can use View():

View(filter(flights, is.na(dep_time)))

dep_delay, arr_time, arr_delay, and air_time are also missing. These flights were most likely cancelled.

  1. Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)

See Jeffrey Arnold’s explanation.

5.3.1 Exercises

  1. How could you use arrange() to sort all missing values to the start? (Hint: use is.na()).
arrange(flights, desc(is.na(dep_time)))
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1       NA           1630        NA       NA
##  2  2013     1     1       NA           1935        NA       NA
##  3  2013     1     1       NA           1500        NA       NA
##  4  2013     1     1       NA            600        NA       NA
##  5  2013     1     2       NA           1540        NA       NA
##  6  2013     1     2       NA           1620        NA       NA
##  7  2013     1     2       NA           1355        NA       NA
##  8  2013     1     2       NA           1420        NA       NA
##  9  2013     1     2       NA           1321        NA       NA
## 10  2013     1     2       NA           1545        NA       NA
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
  1. Sort flights to find the most delayed flights. Find the flights that left earliest.
arrange(flights, desc(dep_delay))
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     9      641            900     1301.     1242
##  2  2013     6    15     1432           1935     1137.     1607
##  3  2013     1    10     1121           1635     1126.     1239
##  4  2013     9    20     1139           1845     1014.     1457
##  5  2013     7    22      845           1600     1005.     1044
##  6  2013     4    10     1100           1900      960.     1342
##  7  2013     3    17     2321            810      911.      135
##  8  2013     6    27      959           1900      899.     1236
##  9  2013     7    22     2257            759      898.      121
## 10  2013    12     5      756           1700      896.     1058
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
arrange(flights, dep_delay)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013    12     7     2040           2123      -43.       40
##  2  2013     2     3     2022           2055      -33.     2240
##  3  2013    11    10     1408           1440      -32.     1549
##  4  2013     1    11     1900           1930      -30.     2233
##  5  2013     1    29     1703           1730      -27.     1947
##  6  2013     8     9      729            755      -26.     1002
##  7  2013    10    23     1907           1932      -25.     2143
##  8  2013     3    30     2030           2055      -25.     2213
##  9  2013     3     2     1431           1455      -24.     1601
## 10  2013     5     5      934            958      -24.     1225
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
  1. Sort flights to find the fastest flights.

This depends how you define “fastest flights”. It could refer to the least amount of air_time:

arrange(flights, air_time)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1    16     1355           1315       40.     1442
##  2  2013     4    13      537            527       10.      622
##  3  2013    12     6      922            851       31.     1021
##  4  2013     2     3     2153           2129       24.     2247
##  5  2013     2     5     1303           1315      -12.     1342
##  6  2013     2    12     2123           2130       -7.     2211
##  7  2013     3     2     1450           1500      -10.     1547
##  8  2013     3     8     2026           1935       51.     2131
##  9  2013     3    18     1456           1329       87.     1533
## 10  2013     3    19     2226           2145       41.     2305
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
  1. Which flights travelled the longest? Which travelled the shortest?

If we assume this refers to distance travelled:

arrange(flights, desc(distance))
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      857            900       -3.     1516
##  2  2013     1     2      909            900        9.     1525
##  3  2013     1     3      914            900       14.     1504
##  4  2013     1     4      900            900        0.     1516
##  5  2013     1     5      858            900       -2.     1519
##  6  2013     1     6     1019            900       79.     1558
##  7  2013     1     7     1042            900      102.     1620
##  8  2013     1     8      901            900        1.     1504
##  9  2013     1     9      641            900     1301.     1242
## 10  2013     1    10      859            900       -1.     1449
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

The longest flight is from JFK to HNL (Honolulu) at 4983 miles.

arrange(flights, distance)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     7    27       NA            106       NA        NA
##  2  2013     1     3     2127           2129       -2.     2222
##  3  2013     1     4     1240           1200       40.     1333
##  4  2013     1     4     1829           1615      134.     1937
##  5  2013     1     4     2128           2129       -1.     2218
##  6  2013     1     5     1155           1200       -5.     1241
##  7  2013     1     6     2125           2129       -4.     2224
##  8  2013     1     7     2124           2129       -5.     2212
##  9  2013     1     8     2127           2130       -3.     2304
## 10  2013     1     9     2126           2129       -3.     2217
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

The shortest flight is from EWR (Newark) to LGA (La Guardia) at 17 miles, however this appears to be a cancelled flight (a number of variables such as dep_time are missing). The next shortest flight is from EWR to PHL (Philadelphia) at 80 miles.

(See airports to find the names that correspond to the FAA airport code.)

5.4.1 Exercises

  1. Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.
select(flights, dep_time, dep_delay, arr_time, arr_delay)
## # A tibble: 336,776 x 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
##  1      517        2.      830       11.
##  2      533        4.      850       20.
##  3      542        2.      923       33.
##  4      544       -1.     1004      -18.
##  5      554       -6.      812      -25.
##  6      554       -4.      740       12.
##  7      555       -5.      913       19.
##  8      557       -3.      709      -14.
##  9      557       -3.      838       -8.
## 10      558       -2.      753        8.
## # ... with 336,766 more rows
select(flights, starts_with("dep"), starts_with("arr"))
## # A tibble: 336,776 x 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
##  1      517        2.      830       11.
##  2      533        4.      850       20.
##  3      542        2.      923       33.
##  4      544       -1.     1004      -18.
##  5      554       -6.      812      -25.
##  6      554       -4.      740       12.
##  7      555       -5.      913       19.
##  8      557       -3.      709      -14.
##  9      557       -3.      838       -8.
## 10      558       -2.      753        8.
## # ... with 336,766 more rows
  1. What happens if you include the name of a variable multiple times in a select() call?

It will only print once.

select(flights, dep_time, dep_time)
## # A tibble: 336,776 x 1
##    dep_time
##       <int>
##  1      517
##  2      533
##  3      542
##  4      544
##  5      554
##  6      554
##  7      555
##  8      557
##  9      557
## 10      558
## # ... with 336,766 more rows
  1. What does the one_of() function do? Why might it be helpful in conjunction with this vector?

    vars <- c("year", "month", "day", "dep_delay", "arr_delay")

The one_of() function allows you to use variables stored in a character vector:

select(flights, one_of(vars))
## # A tibble: 336,776 x 5
##     year month   day dep_delay arr_delay
##    <int> <int> <int>     <dbl>     <dbl>
##  1  2013     1     1        2.       11.
##  2  2013     1     1        4.       20.
##  3  2013     1     1        2.       33.
##  4  2013     1     1       -1.      -18.
##  5  2013     1     1       -6.      -25.
##  6  2013     1     1       -4.       12.
##  7  2013     1     1       -5.       19.
##  8  2013     1     1       -3.      -14.
##  9  2013     1     1       -3.       -8.
## 10  2013     1     1       -2.        8.
## # ... with 336,766 more rows

It improves the readability of code.

  1. Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?

    select(flights, contains("TIME"))

Select helpers are not case sensitive by default. To change the default, use ignore.case = FALSE.

5.5.2 Exercises

  1. Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.
transmute(flights, 
  dep_time,
  hour = dep_time %/% 100,
  minute = dep_time %% 100,
  dep_time_mins = (hour * 60) + minute
)
## # A tibble: 336,776 x 4
##    dep_time  hour minute dep_time_mins
##       <int> <dbl>  <dbl>         <dbl>
##  1      517    5.    17.          317.
##  2      533    5.    33.          333.
##  3      542    5.    42.          342.
##  4      544    5.    44.          344.
##  5      554    5.    54.          354.
##  6      554    5.    54.          354.
##  7      555    5.    55.          355.
##  8      557    5.    57.          357.
##  9      557    5.    57.          357.
## 10      558    5.    58.          358.
## # ... with 336,766 more rows
transmute(flights, 
  sched_dep_time,
  hour = sched_dep_time %/% 100,
  minute = sched_dep_time %% 100,
  sched_dep_mins = (hour * 60) + minute
)
## # A tibble: 336,776 x 4
##    sched_dep_time  hour minute sched_dep_mins
##             <int> <dbl>  <dbl>          <dbl>
##  1            515    5.    15.           315.
##  2            529    5.    29.           329.
##  3            540    5.    40.           340.
##  4            545    5.    45.           345.
##  5            600    6.     0.           360.
##  6            558    5.    58.           358.
##  7            600    6.     0.           360.
##  8            600    6.     0.           360.
##  9            600    6.     0.           360.
## 10            600    6.     0.           360.
## # ... with 336,766 more rows
  1. Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?
transmute(flights, air_time, arr_time - dep_time)
## # A tibble: 336,776 x 2
##    air_time `arr_time - dep_time`
##       <dbl>                 <int>
##  1     227.                   313
##  2     227.                   317
##  3     160.                   381
##  4     183.                   460
##  5     116.                   258
##  6     150.                   186
##  7     158.                   358
##  8      53.                   152
##  9     140.                   281
## 10     138.                   195
## # ... with 336,766 more rows

You would expect arr_time - dep_time == air_time. This discrepancy is explained by timezone differences. arr_time and dep_time are in local timezones that may differ.

  1. Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?
select(flights, dep_time, sched_dep_time, dep_delay)
## # A tibble: 336,776 x 3
##    dep_time sched_dep_time dep_delay
##       <int>          <int>     <dbl>
##  1      517            515        2.
##  2      533            529        4.
##  3      542            540        2.
##  4      544            545       -1.
##  5      554            600       -6.
##  6      554            558       -4.
##  7      555            600       -5.
##  8      557            600       -3.
##  9      557            600       -3.
## 10      558            600       -2.
## # ... with 336,766 more rows

dep_time - sched_dep_time == dep_delay.

  1. Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for min_rank().
flights_ranked <- mutate(flights, rank = min_rank(-dep_delay))
arrange(flights_ranked, desc(dep_delay))
## # A tibble: 336,776 x 20
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     9      641            900     1301.     1242
##  2  2013     6    15     1432           1935     1137.     1607
##  3  2013     1    10     1121           1635     1126.     1239
##  4  2013     9    20     1139           1845     1014.     1457
##  5  2013     7    22      845           1600     1005.     1044
##  6  2013     4    10     1100           1900      960.     1342
##  7  2013     3    17     2321            810      911.      135
##  8  2013     6    27      959           1900      899.     1236
##  9  2013     7    22     2257            759      898.      121
## 10  2013    12     5      756           1700      896.     1058
## # ... with 336,766 more rows, and 13 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, rank <int>

Ties should be assigned the lowest rank. ties.method = "min" is the default for min_rank().

  1. What does 1:3 + 1:10 return? Why?
1:3 + 1:10
## Warning in 1:3 + 1:10: longer object length is not a multiple of shorter
## object length
##  [1]  2  4  6  5  7  9  8 10 12 11

1 + 1 = 2; 2 + 2 = 4; 3 + 3 = 6; 4 + 1 = 5; 5 + 2 = 7, and so on.

  1. What trigonometric functions does R provide?

Cosine, sine, tangent, arc-cosine, and more. See ?Trig.

5.6.7 Exercises

  1. Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:

    • A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.

    • A flight is always 10 minutes late.

    • A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.

    • 99% of the time a flight is on time. 1% of the time it’s 2 hours late.

    Which is more important: arrival delay or departure delay?

flights %>% 
  group_by(flight, origin, dest) %>% 
  filter(arr_delay == 10) %>% 
  summarise()
## # A tibble: 1,877 x 3
## # Groups:   flight, origin [?]
##    flight origin dest 
##     <int> <chr>  <chr>
##  1      1 JFK    FLL  
##  2      3 EWR    DEN  
##  3      3 JFK    LAX  
##  4      3 JFK    SJU  
##  5      3 LGA    MDW  
##  6      4 JFK    BUF  
##  7      4 JFK    SJU  
##  8      5 EWR    FLL  
##  9      5 EWR    SEA  
## 10      6 JFK    BUF  
## # ... with 1,867 more rows
flights %>% 
  group_by(flight, origin, dest) %>%
  summarise(avg_arr_delay = mean(arr_delay, na.rm = TRUE), count = n()) %>% 
  filter(avg_arr_delay == 10)
## # A tibble: 65 x 5
## # Groups:   flight, origin [64]
##    flight origin dest  avg_arr_delay count
##     <int> <chr>  <chr>         <dbl> <int>
##  1      3 EWR    DEN             10.    23
##  2    102 LGA    MKE             10.     1
##  3    171 EWR    DEN             10.     5
##  4    252 EWR    RSW             10.     2
##  5    265 EWR    FLL             10.     2
##  6    271 EWR    TPA             10.     1
##  7    279 LGA    IAH             10.     1
##  8    299 EWR    PHX             10.     4
##  9    316 EWR    SFO             10.     1
## 10    363 EWR    DFW             10.     1
## # ... with 55 more rows

Departure delay is probably more important for travellers.

  1. Come up with another approach that will give you the same output as not_cancelled %>% count(dest) and not_cancelled %>% count(tailnum, wt = distance) (without using count()).
not_cancelled %>% 
  group_by(dest) %>% 
  summarise(count = n())
## # A tibble: 104 x 2
##    dest  count
##    <chr> <int>
##  1 ABQ     254
##  2 ACK     264
##  3 ALB     418
##  4 ANC       8
##  5 ATL   16837
##  6 AUS    2411
##  7 AVL     261
##  8 BDL     412
##  9 BGR     358
## 10 BHM     269
## # ... with 94 more rows
not_cancelled %>% 
  group_by(tailnum) %>% 
  summarise(distance_total = sum(distance))
## # A tibble: 4,037 x 2
##    tailnum distance_total
##    <chr>            <dbl>
##  1 D942DN           3418.
##  2 N0EGMQ         239143.
##  3 N10156         109664.
##  4 N102UW          25722.
##  5 N103US          24619.
##  6 N104UW          24616.
##  7 N10575         139903.
##  8 N105UW          23618.
##  9 N107US          21677.
## 10 N108UW          32070.
## # ... with 4,027 more rows
  1. Our definition of cancelled flights (is.na(dep_delay) | is.na(arr_delay) ) is slightly suboptimal. Why? Which is the most important column?

Flights that do no depart cannot arrive at any location. However, flights that depart but are NA for arr_delay may have redirected or crashed.

The importance of the column depends on the question being answered. dep_delay is probably more representative of cancelled flights for the above reason.

  1. Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?
flights %>% 
  group_by(month, day) %>% 
  summarise(cancelled = sum(is.na(dep_delay)))
## # A tibble: 365 x 3
## # Groups:   month [?]
##    month   day cancelled
##    <int> <int>     <int>
##  1     1     1         4
##  2     1     2         8
##  3     1     3        10
##  4     1     4         6
##  5     1     5         3
##  6     1     6         1
##  7     1     7         3
##  8     1     8         4
##  9     1     9         5
## 10     1    10         3
## # ... with 355 more rows

It is difficult to tell just from this output. Let’s visualise this instead:

flights %>% 
  mutate(cancelled = is.na(dep_delay)) %>% 
  group_by(month, day) %>% 
  summarise(
    prop_cancelled = mean(cancelled),
    avg_dep_delay = mean(dep_delay, na.rm = TRUE)
  ) %>% 
  ggplot(aes(x = prop_cancelled, y = avg_dep_delay)) +
    geom_point() +
    geom_smooth(se = TRUE)
## `geom_smooth()` using method = 'loess'

There is a positive relationship between the proportion of cancelled flights and average delay. However, there are a few exceptional days with a high number of cancelled flights and relatively low average delay. If I had to guess, this might be related to bad weather (i.e., all flights were cancelled for a specific time period, so avg_dep_delay was close to zero).

  1. Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights %>% group_by(carrier, dest) %>% summarise(n()))

Frontier Airlines (F9) has the worst delays:

not_cancelled %>% 
  group_by(carrier) %>% 
  summarise(avg_delay = mean(dep_delay, na.rm = TRUE)) %>% 
  arrange(desc(avg_delay))
## # A tibble: 16 x 2
##    carrier avg_delay
##    <chr>       <dbl>
##  1 F9          20.2 
##  2 EV          19.8 
##  3 YV          18.9 
##  4 FL          18.6 
##  5 WN          17.7 
##  6 9E          16.4 
##  7 B6          13.0 
##  8 VX          12.8 
##  9 OO          12.6 
## 10 UA          12.0 
## 11 MQ          10.4 
## 12 DL           9.22
## 13 AA           8.57
## 14 AS           5.83
## 15 HA           4.90
## 16 US           3.74

You cannot disentangle the effects of bad airports vs. bad carriers because there is an uneven distribution in destinations by carrier. Some carriers might operate more frequently from bad airports and vice versa.

  1. What does the sort argument to count() do. When might you use it?

sort = TRUE will sort output in descending order of n. You can use it for ranking instead of using min_rank() and/or arrange().

not_cancelled %>% count(dest, sort = TRUE)
## # A tibble: 104 x 2
##    dest      n
##    <chr> <int>
##  1 ATL   16837
##  2 ORD   16566
##  3 LAX   16026
##  4 BOS   15022
##  5 MCO   13967
##  6 CLT   13674
##  7 SFO   13173
##  8 FLL   11897
##  9 MIA   11593
## 10 DCA    9111
## # ... with 94 more rows

5.7.1 Exercises

  1. Refer back to the lists of useful mutate and filtering functions. Describe how each operation changes when you combine it with grouping.

When combined with grouping, functions operate within groups only, not the entire dataset. For example:

# What is the mean air time for each carrier?
flights %>%
  group_by(carrier) %>% 
  summarise(mean_air_time = mean(air_time, na.rm = TRUE))
## # A tibble: 16 x 2
##    carrier mean_air_time
##    <chr>           <dbl>
##  1 9E               86.8
##  2 AA              189. 
##  3 AS              326. 
##  4 B6              151. 
##  5 DL              174. 
##  6 EV               90.1
##  7 F9              230. 
##  8 FL              101. 
##  9 HA              623. 
## 10 MQ               91.2
## 11 OO               83.5
## 12 UA              212. 
## 13 US               88.6
## 14 VX              337. 
## 15 WN              148. 
## 16 YV               65.7
# What is the mean arrival delay for each month excluding all flights that were not delayed?
flights %>% 
  filter(arr_delay > 0) %>% 
  group_by(month) %>%
  summarise(avg_arr_delay = mean(arr_delay))
## # A tibble: 12 x 2
##    month avg_arr_delay
##    <int>         <dbl>
##  1     1          34.5
##  2     2          33.7
##  3     3          40.6
##  4     4          42.7
##  5     5          41.9
##  6     6          53.7
##  7     7          54.0
##  8     8          39.5
##  9     9          38.8
## 10    10          29.0
## 11    11          27.5
## 12    12          39.7
  1. Which plane (tailnum) has the worst on-time record?

N844MH.

flights %>% 
  group_by(tailnum) %>%
  summarise(avg_delay = mean(arr_delay, na.rm = TRUE)) %>% 
  arrange(desc(avg_delay))
## # A tibble: 4,044 x 2
##    tailnum avg_delay
##    <chr>       <dbl>
##  1 N844MH       320.
##  2 N911DA       294.
##  3 N922EV       276.
##  4 N587NW       264.
##  5 N851NW       219.
##  6 N928DN       201.
##  7 N7715E       188.
##  8 N654UA       185.
##  9 N665MQ       175.
## 10 N427SW       157.
## # ... with 4,034 more rows
  1. What time of day should you fly if you want to avoid delays as much as possible?

Morning, preferably before 9 am.

flights %>%
  group_by(hour) %>%
  summarise(avg_delay = mean(arr_delay, na.rm = TRUE)) %>% 
  arrange(avg_delay)
## # A tibble: 20 x 2
##     hour avg_delay
##    <dbl>     <dbl>
##  1    7.    -5.30 
##  2    5.    -4.80 
##  3    6.    -3.38 
##  4    9.    -1.45 
##  5    8.    -1.11 
##  6   10.     0.954
##  7   11.     1.48 
##  8   12.     3.49 
##  9   13.     6.54 
## 10   14.     9.20 
## 11   23.    11.8  
## 12   15.    12.3  
## 13   16.    12.6  
## 14   18.    14.8  
## 15   22.    16.0  
## 16   17.    16.0  
## 17   19.    16.7  
## 18   20.    16.7  
## 19   21.    18.4  
## 20    1.   NaN
  1. For each destination, compute the total minutes of delay. For each, flight, compute the proportion of the total delay for its destination.
flights %>% 
  group_by(dest) %>% 
  summarise(total_delay = sum(arr_delay, na.rm = TRUE))
## # A tibble: 105 x 2
##    dest  total_delay
##    <chr>       <dbl>
##  1 ABQ         1113.
##  2 ACK         1281.
##  3 ALB         6018.
##  4 ANC          -20.
##  5 ATL       190260.
##  6 AUS        14514.
##  7 AVL         2089.
##  8 BDL         2904.
##  9 BGR         2874.
## 10 BHM         4540.
## # ... with 95 more rows
  1. Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using lag() explore how the delay of a flight is related to the delay of the immediately preceding flight.
flights %>% 
  group_by(year, month, day) %>% 
  mutate(lag_delay = lag(dep_delay)) %>% 
  ggplot(aes(x = dep_delay, y = lag_delay)) +
    geom_point() +
    geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 8620 rows containing non-finite values (stat_smooth).
## Warning: Removed 8620 rows containing missing values (geom_point).

  1. Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error). Compute the air time a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?

  2. Find all destinations that are flown by at least two carriers. Use that information to rank the carriers.

flights %>% 
  group_by(dest, carrier) %>% 
  count(carrier) %>% 
  filter(carrier >= 2) %>% 
  group_by(carrier) %>% 
  count(sort = TRUE)
## # A tibble: 16 x 2
## # Groups:   carrier [16]
##    carrier    nn
##    <chr>   <int>
##  1 EV         61
##  2 9E         49
##  3 UA         47
##  4 B6         42
##  5 DL         40
##  6 MQ         20
##  7 AA         19
##  8 WN         11
##  9 US          6
## 10 OO          5
## 11 VX          5
## 12 FL          3
## 13 YV          3
## 14 AS          1
## 15 F9          1
## 16 HA          1
  1. For each plane, count the number of flights before the first delay of greater than 1 hour.
flights %>% 
  group_by(flight) %>% 
  filter(dep_delay <= 60) %>% 
  count()
## # A tibble: 3,810 x 2
## # Groups:   flight [3,810]
##    flight     n
##     <int> <int>
##  1      1   678
##  2      2    51
##  3      3   609
##  4      4   378
##  5      5   306
##  6      6   198
##  7      7   210
##  8      8   226
##  9      9   135
## 10     10    51
## # ... with 3,800 more rows

6. Workflow: scripts

6.3 Practice

  1. Go to the RStudio Tips twitter account, https://twitter.com/rstudiotips and find one tip that looks interesting. Practice using it!

Ctrl + 1 (focus editor) and Ctrl + 2 (focus console) are my personal favourites.

  1. What other common mistakes will RStudio diagnostics report? Read https://support.rstudio.com/hc/en-us/articles/205753617-Code-Diagnostics to find out.

Common mistakes such as:

  • Warn if variable used has no definition in scope
  • Warn if variable is defined but not used
  • Style

Also, diagnostics for other languages such as C++, JavaScript, and Python.

7. Exploratory Data Analysis

7.3.4 Exercises

  1. Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.
ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = x), binwidth = 0.5)

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = x), binwidth = 0.1) +
  coord_cartesian(ylim = c(0, 500))

diamonds %>% 
  count(cut_width(x, 0.5))
## # A tibble: 16 x 2
##    `cut_width(x, 0.5)`     n
##    <fct>               <int>
##  1 [-0.25,0.25]            8
##  2 (3.25,3.75]             3
##  3 (3.75,4.25]          1834
##  4 (4.25,4.75]         12680
##  5 (4.75,5.25]          7502
##  6 (5.25,5.75]          6448
##  7 (5.75,6.25]          6031
##  8 (6.25,6.75]          9381
##  9 (6.75,7.25]          4193
## 10 (7.25,7.75]          3437
## 11 (7.75,8.25]          1620
## 12 (8.25,8.75]           699
## 13 (8.75,9.25]            79
## 14 (9.25,9.75]            18
## 15 (9.75,10.2]             6
## 16 (10.2,10.8]             1
ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = y), binwidth = 0.5)

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
  coord_cartesian(ylim = c(0, 50))

diamonds %>% 
  count(cut_width(y, 0.5))
## # A tibble: 18 x 2
##    `cut_width(y, 0.5)`     n
##    <fct>               <int>
##  1 [-0.25,0.25]            7
##  2 (3.25,3.75]             6
##  3 (3.75,4.25]          1730
##  4 (4.25,4.75]         12566
##  5 (4.75,5.25]          7556
##  6 (5.25,5.75]          6272
##  7 (5.75,6.25]          6464
##  8 (6.25,6.75]          9382
##  9 (6.75,7.25]          4176
## 10 (7.25,7.75]          3425
## 11 (7.75,8.25]          1612
## 12 (8.25,8.75]           654
## 13 (8.75,9.25]            67
## 14 (9.25,9.75]            14
## 15 (9.75,10.2]             6
## 16 (10.2,10.8]             1
## 17 (31.8,32.2]             1
## 18 (58.8,59.2]             1
ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = z), binwidth = 0.5)

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = z), binwidth = 0.1) +
  coord_cartesian(ylim = c(0, 50))

diamonds %>% 
  count(cut_width(z, 0.5))
## # A tibble: 16 x 2
##    `cut_width(z, 0.5)`     n
##    <fct>               <int>
##  1 [-0.25,0.25]           20
##  2 (0.75,1.25]             1
##  3 (1.25,1.75]             2
##  4 (1.75,2.25]             3
##  5 (2.25,2.75]          9276
##  6 (2.75,3.25]         13340
##  7 (3.25,3.75]          9572
##  8 (3.75,4.25]         13584
##  9 (4.25,4.75]          5589
## 10 (4.75,5.25]          2288
## 11 (5.25,5.75]           238
## 12 (5.75,6.25]            19
## 13 (6.25,6.75]             5
## 14 (6.75,7.25]             1
## 15 (7.75,8.25]             1
## 16 (31.8,32.2]             1

There are a number of outliers in each distribution that require further investigation.

As seen when you compute the bin widths and corresponding counts by hand, the most common value for x and y is between 4.25 and 4.75. For z, the most common value is between 3.75 and 4.25. The bin width from 2.75 to 3.25 is also very common. Based on these values, x and y appear to be the length and width of the diamond, while z is most likely the depth. This is confirmed in the documentation (?diamonds).

  1. Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)
ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = price))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = price), binwidth = 10)

There appears to be no diamonds with a price around $1300, as indicated by the missing gap in the histogram above. This is not consistent with the overall pattern.

  1. How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?
diamonds %>% 
  count(carat) %>% 
  filter(carat %in% c(0.99, 1))
## # A tibble: 2 x 2
##   carat     n
##   <dbl> <int>
## 1 0.990    23
## 2 1.00   1558

23 diamonds are 0.99 carat. In comparison, there are over 1,558 diamonds that are 1 carat.

Maybe 0.99 carat diamonds are more likely to be rounded up.

diamonds %>% 
  filter(carat == 0.99) %>% 
  select(carat, cut, price, x, y, z)
## # A tibble: 23 x 6
##    carat cut       price     x     y     z
##    <dbl> <ord>     <int> <dbl> <dbl> <dbl>
##  1 0.990 Fair       2811  6.21  6.06  4.18
##  2 0.990 Fair       2812  6.72  6.67  3.68
##  3 0.990 Fair       2949  6.57  6.50  3.79
##  4 0.990 Fair       3337  6.42  6.34  3.87
##  5 0.990 Fair       3593  5.94  5.80  4.20
##  6 0.990 Very Good  4002  6.44  6.49  3.90
##  7 0.990 Good       4052  6.36  6.43  4.05
##  8 0.990 Premium    4075  6.45  6.38  3.89
##  9 0.990 Ideal      4763  6.40  6.42  3.96
## 10 0.990 Very Good  4780  6.30  6.33  3.90
## # ... with 13 more rows
diamonds %>% 
  filter(carat == 1) %>% 
  select(carat, cut, price, x, y, z) %>% 
  ggplot(mapping = aes(x = cut)) +
  geom_bar()

diamonds %>% 
  filter(carat == 1) %>% 
  select(price) %>% 
  summary(price)
##      price      
##  Min.   : 1681  
##  1st Qu.: 4155  
##  Median : 4864  
##  Mean   : 5242  
##  3rd Qu.: 6073  
##  Max.   :16469
diamonds %>% 
  filter(carat == 1) %>% 
  select(carat, cut, price, x, y, z) %>% 
  ggplot(mapping = aes(x = price)) +
  geom_histogram(binwidth = 250)

  1. Compare and contrast coord_cartesian() vs xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?
ggplot(diamonds) + 
  geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
  coord_cartesian(ylim = c(0, 50))

ggplot(diamonds) + 
  geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
  ylim(0, 50)
## Warning: Removed 11 rows containing missing values (geom_bar).

xlim() and ylim() throw away data outside the limits.

ggplot(diamonds) + 
  geom_histogram(mapping = aes(x = y)) +
  coord_cartesian(ylim = c(0, 50))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(diamonds) + 
  geom_histogram(mapping = aes(x = y)) +
  ylim(0, 50)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 3 rows containing missing values (geom_bar).

If binwidth is unset, it is easier to discern the bars when zooming in on a plot.

If you try to zoom with xlim() or ylim() so only half a bar shows, the data in that binwidth will be thrown away. So, the bar will disappear from the plot.

7.4.1 Exercises

  1. What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?

In a histogram and bar chart, by default, missing values are removed with a warning. So, there is no actual difference as the question indicates.

  1. What does na.rm = TRUE do in mean() and sum()?

It removes missing values before the computation proceeds.

7.5.1.1 Exercises

  1. Use what you’ve learned to improve the visualisation of the departure times of cancelled vs. non-cancelled flights.
flights %>%
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + sched_min / 60
  ) %>% 
  ggplot(mapping = aes(sched_dep_time)) + 
    geom_boxplot(mapping = aes(x = cancelled, y = sched_dep_time))

  1. What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?

Return to this question after completing the section on modelling. If I had to guess, I think carat might be the most important for predicting the price of a diamond. There is an exponential relationship between carat and price:

ggplot(diamonds, mapping = aes(x = carat, y = price)) +
  geom_point(alpha = 1/100)

  1. Install the ggstance package, and create a horizontal boxplot. How does this compare to using coord_flip()?

ggstance provides flipped versions of geoms, stats, and positions. This means you can provide aesthetics in their natural order, making the code more readable:

ggplot(data = mpg) +
  geom_boxploth(mapping = aes(x = hwy, y = reorder(class, hwy, FUN = median)))

In comparsion, coord_flip() can only flip the plot as a whole.

  1. One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using geom_lv() to display the distribution of price vs cut. What do you learn? How do you interpret the plots?
ggplot(diamonds) +
  geom_boxplot(mapping = aes(x = cut, y = price))

The boxplots show many outliers, which is typical for large sample sizes (n > 1000).

ggplot(data = diamonds) +
  geom_lv(mapping = aes(x = cut, y = price))

Using the more precise (for n > 1000) letter-value box plots, there are less outliers. With letter-value box plots, the fattest box indicates the IQR.

  1. Compare and contrast geom_violin() with a facetted geom_histogram(), or a coloured geom_freqpoly(). What are the pros and cons of each method?
ggplot(diamonds) +
  geom_violin(mapping = aes(x = cut, y = price))

ggplot(diamonds) +
  geom_histogram(mapping = aes(x = price), binwidth = 1000) +
  facet_wrap(~ cut)

geom_violin() displays everything in one plot. However, it only shows the variability within the cut variable. In comparison, the facetted geom_histogram() shows the overall count for each binwidth. So, from the facetted histogram, it can be seen that not many diamonds have cut == Fair compared to, say, cut == Ideal.

  1. If you have a small dataset, it’s sometimes useful to use geom_jitter() to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to geom_jitter(). List them and briefly describe what each one does.

ggbeeswarm provides two methods to reduce overplotting: geom_quasirandom and geom_beeswarm. Both methods show the density of the data at each point, while still showing each individual point.

ggplot(iris,aes(Species, Sepal.Length)) +
  geom_point()

ggplot(iris,aes(Species, Sepal.Length)) +
  geom_jitter()

ggplot(iris, mapping = aes(Species, Sepal.Length)) +
  geom_quasirandom()

ggplot(iris, mapping = aes(Species, Sepal.Length)) +
  geom_beeswarm()

geom_quasirandom spaces dots using a van der Corput sequence or Tukey texturing. In comparison, geom_beeswarm uses the beeswarm library.

7.5.2.1 Exercises

  1. How could you rescale the count dataset above to more clearly show the distribution of cut within colour, or colour within cut?
diamonds %>% 
  count(cut, color)
## # A tibble: 35 x 3
##    cut   color     n
##    <ord> <ord> <int>
##  1 Fair  D       163
##  2 Fair  E       224
##  3 Fair  F       312
##  4 Fair  G       314
##  5 Fair  H       303
##  6 Fair  I       175
##  7 Fair  J       119
##  8 Good  D       662
##  9 Good  E       933
## 10 Good  F       909
## # ... with 25 more rows
diamonds %>% 
  count(color, cut) %>%  
  ggplot(mapping = aes(x = color, y = cut)) +
    geom_tile(mapping = aes(fill = n))

diamonds %>% 
  count(cut, color) %>%  
  ggplot(mapping = aes(x = color, y = cut)) +
    geom_tile(mapping = aes(fill = n))

  1. Use geom_tile() together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?
flights %>%
  group_by(month, dest) %>%
  summarise(avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot(mapping = aes(x = factor(month), y = dest)) +
    geom_tile(mapping = aes(fill = avg_delay))

The plot is difficult to read because there are many destinations on the y-axis. There are also missing values that could be removed to improve readability.

  1. Why is it slightly better to use aes(x = color, y = cut) rather than aes(x = cut, y = color) in the example above?

The larger counts are located in the top left which makes it more readable.

8. Workflow: projects

No exercises.

10. Tibbles

10.5 Exercises

  1. How can you tell if an object is a tibble? (Hint: try printing mtcars, which is a regular data frame).

# A tibble: is displayed in the top left of the output. Also, additional information is printed, such as column types and total number of rows and columns.

as_tibble(mtcars)
## # A tibble: 32 x 11
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##  * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21.0    6.  160.  110.  3.90  2.62  16.5    0.    1.    4.    4.
##  2  21.0    6.  160.  110.  3.90  2.88  17.0    0.    1.    4.    4.
##  3  22.8    4.  108.   93.  3.85  2.32  18.6    1.    1.    4.    1.
##  4  21.4    6.  258.  110.  3.08  3.22  19.4    1.    0.    3.    1.
##  5  18.7    8.  360.  175.  3.15  3.44  17.0    0.    0.    3.    2.
##  6  18.1    6.  225.  105.  2.76  3.46  20.2    1.    0.    3.    1.
##  7  14.3    8.  360.  245.  3.21  3.57  15.8    0.    0.    3.    4.
##  8  24.4    4.  147.   62.  3.69  3.19  20.0    1.    0.    4.    2.
##  9  22.8    4.  141.   95.  3.92  3.15  22.9    1.    0.    4.    2.
## 10  19.2    6.  168.  123.  3.92  3.44  18.3    1.    0.    4.    4.
## # ... with 22 more rows
  1. Compare and contrast the following operations on a data.frame and equivalent tibble. What is different? Why might the default data frame behaviours cause you frustration?

    df <- data.frame(abc = 1, xyz = "a")
    df$x
    df[, "xyz"]
    df[, c("abc", "xyz")]
df <- tibble(abc = 1, xyz = "a")
df$x
## Warning: Unknown or uninitialised column: 'x'.
## NULL
df[, "xyz"]
## # A tibble: 1 x 1
##   xyz  
##   <chr>
## 1 a
df[, c("abc", "xyz")]
## # A tibble: 1 x 2
##     abc xyz  
##   <dbl> <chr>
## 1    1. a

The default data frame uses partial matching. In comparison, tibbles do not. Data frames also store strings as factors by default.

df <- data.frame(abc = 1, xyz = "a")
glimpse(df)
## Observations: 1
## Variables: 2
## $ abc <dbl> 1
## $ xyz <fct> a
  1. If you have the name of a variable stored in an object, e.g. var <- "mpg", how can you extract the reference variable from a tibble?

By using the double bracket, [[. For example, df[[var]].

  1. Practice referring to non-syntactic names in the following data frame by:

    1. Extracting the variable called 1.

    2. Plotting a scatterplot of 1 vs 2.

    3. Creating a new column called 3 which is 2 divided by 1.

    4. Renaming the columns to one, two and three.

    annoying <- tibble(
      `1` = 1:10,
      `2` = `1` * 2 + rnorm(length(`1`))
    )

Extracting the variable called 1:

annoying$`1`
##  [1]  1  2  3  4  5  6  7  8  9 10

Plotting a scatterplot of 1 vs 2:

ggplot(annoying, aes(`1`, `2`)) +
  geom_point()

Creating a new column called 3 which is 2 divided by 1:

annoying <- annoying %>%
  mutate(`3` = `2` / `1`)

Renaming the columns to one, two and three:

annoying <- annoying %>%
  rename(`one` = `1`,
         `two` = `2`,
         `three` = `3`)
  1. What does tibble::enframe() do? When might you use it?

It converts atomic vectors or lists to two-column data frames.

  1. What option controls how many additional column names are printed at the footer of a tibble?

n_extra from print.tbl(). For example:

print(flights, n_extra = 1)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515        2.      830
##  2  2013     1     1      533            529        4.      850
##  3  2013     1     1      542            540        2.      923
##  4  2013     1     1      544            545       -1.     1004
##  5  2013     1     1      554            600       -6.      812
##  6  2013     1     1      554            558       -4.      740
##  7  2013     1     1      555            600       -5.      913
##  8  2013     1     1      557            600       -3.      709
##  9  2013     1     1      557            600       -3.      838
## 10  2013     1     1      558            600       -2.      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   …

11. Data import

11.2.2 Exercises

  1. What function would you use to read a file where fields were separated with
    “|”?

read_delim(). The delim argument can be used to separate fields with “|”.

  1. Apart from file, skip, and comment, what other arguments do read_csv() and read_tsv() have in common?

Some other arguments they have in common: col_names, col_types, locale, and na. In fact, it appears they share the exact same arguments.

  1. What are the most important arguments to read_fwf()?

widths, start, and end.

  1. Sometimes strings in a CSV file contain commas. To prevent them from causing problems they need to be surrounded by a quoting character, like " or '. By convention, read_csv() assumes that the quoting character will be ", and if you want to change it you’ll need to use read_delim() instead. What arguments do you need to specify to read the following text into a data frame?

    "x,y\n1,'a,b'"
x <- "x,y\n1,'a,b'"
df <- read_delim(x, ",", quote = "'")
  1. Identify what is wrong with each of the following inline CSV files. What happens when you run the code?

    read_csv("a,b\n1,2,3\n4,5,6")
    read_csv("a,b,c\n1,2\n1,2,3,4")
    read_csv("a,b\n\"1")
    read_csv("a,b\n1,2\na,b")
    read_csv("a;b\n1;3")

Let’s fix these:

read_csv("a,b,c\n1,2,3\n4,5,6")
## # A tibble: 2 x 3
##       a     b     c
##   <int> <int> <int>
## 1     1     2     3
## 2     4     5     6
read_csv("a,b,c\n1,2,1\n2,3,4")
## # A tibble: 2 x 3
##       a     b     c
##   <int> <int> <int>
## 1     1     2     1
## 2     2     3     4
read_csv("a,b\n1")
## # A tibble: 1 x 2
##       a b    
##   <int> <chr>
## 1     1 <NA>

It’s unclear what the intent was here:

read_csv("a,b\n1,2\na,b")
## # A tibble: 2 x 2
##   a     b    
##   <chr> <chr>
## 1 1     2    
## 2 a     b
read_csv("a,b\n1,3")
## # A tibble: 1 x 2
##       a     b
##   <int> <int>
## 1     1     3

11.3.5 Exercises

  1. What are the most important arguments to locale()?

date_names and maybe decimal_mark.

  1. What happens if you try and set decimal_mark and grouping_mark to the same character? What happens to the default value of grouping_mark when you set decimal_mark to “,”? What happens to the default value of decimal_mark when you set the grouping_mark to “.”?
locale(decimal_mark = ",", grouping_mark = ",")
## Error: `decimal_mark` and `grouping_mark` must be different

It produces an error.

locale(decimal_mark = ",")
## <locale>
## Numbers:  123.456,78
## Formats:  %AD / %AT
## Timezone: UTC
## Encoding: UTF-8
## <date_names>
## Days:   Sunday (Sun), Monday (Mon), Tuesday (Tue), Wednesday (Wed),
##         Thursday (Thu), Friday (Fri), Saturday (Sat)
## Months: January (Jan), February (Feb), March (Mar), April (Apr), May
##         (May), June (Jun), July (Jul), August (Aug), September
##         (Sep), October (Oct), November (Nov), December (Dec)
## AM/PM:  AM/PM

The default value for grouping_mark is changed to “.”.

locale(grouping_mark = ".")
## <locale>
## Numbers:  123.456,78
## Formats:  %AD / %AT
## Timezone: UTC
## Encoding: UTF-8
## <date_names>
## Days:   Sunday (Sun), Monday (Mon), Tuesday (Tue), Wednesday (Wed),
##         Thursday (Thu), Friday (Fri), Saturday (Sat)
## Months: January (Jan), February (Feb), March (Mar), April (Apr), May
##         (May), June (Jun), July (Jul), August (Aug), September
##         (Sep), October (Oct), November (Nov), December (Dec)
## AM/PM:  AM/PM

The default value for decimal_mark is changed to “,”.

  1. I didn’t discuss the date_format and time_format options to locale(). What do they do? Construct an example that shows when they might be useful.

date_format and time_format allows you to specify the default date and time formats.

str(parse_guess("01/13/2018", locale = locale(date_format = "%m/%d/%Y")))
##  Date[1:1], format: "2018-01-13"

According to vignette("readr"), time_format isn’t currently used for anything.

  1. If you live outside the US, create a new locale object that encapsulates the settings for the types of file you read most commonly.
locale(date_names = "en", date_format = "%d/%m/%Y", decimal_mark = ".",
       grouping_mark = ",", tz = "Australia/Perth")
## <locale>
## Numbers:  123,456.78
## Formats:  %d/%m/%Y / %AT
## Timezone: Australia/Perth
## Encoding: UTF-8
## <date_names>
## Days:   Sunday (Sun), Monday (Mon), Tuesday (Tue), Wednesday (Wed),
##         Thursday (Thu), Friday (Fri), Saturday (Sat)
## Months: January (Jan), February (Feb), March (Mar), April (Apr), May
##         (May), June (Jun), July (Jul), August (Aug), September
##         (Sep), October (Oct), November (Nov), December (Dec)
## AM/PM:  AM/PM
  1. What’s the difference between read_csv() and read_csv2()?

read_csv2 uses “;” (i.e., a semi-colon) for separators instead of “,” (a comma).

  1. What are the most common encodings used in Europe? What are the most common encodings used in Asia? Do some googling to find out.

UTF-8 is the standard worldwide. In Europe, ISO 8859-1 is a common encoding. Shift JIS is a common encoding in Asia.

  1. Generate the correct format string to parse each of the following dates and times:
d1 <- "January 1, 2010"
d2 <- "2015-Mar-07"
d3 <- "06-Jun-2017"
d4 <- c("August 19 (2015)", "July 1 (2015)")
d5 <- "12/30/14" # Dec 30, 2014
t1 <- "1705"
t2 <- "11:15:10.12 PM"
str(parse_date(d1, format = "%B %d, %Y"))
##  Date[1:1], format: "2010-01-01"
str(parse_date(d2, format = "%Y-%b-%d"))
##  Date[1:1], format: "2015-03-07"
str(parse_date(d3, format = "%d-%b-%Y"))
##  Date[1:1], format: "2017-06-06"
str(parse_date(d4, format = "%B %d (%Y)"))
##  Date[1:2], format: "2015-08-19" "2015-07-01"
str(parse_date(d5, format = "%m/%d/%y"))
##  Date[1:1], format: "2014-12-30"
str(parse_time(t1, format = "%H%M"))
## Classes 'hms', 'difftime'  atomic [1:1] 61500
##   ..- attr(*, "units")= chr "secs"
str(parse_time(t2, format = "%H:%M:%OS %p"))
## Classes 'hms', 'difftime'  atomic [1:1] 83710
##   ..- attr(*, "units")= chr "secs"

12. Tidy data

12.2.1 Exercises

  1. Using prose, describe how the variables and observations are organised in each of the sample tables.

In table1, each variable has its own column, each observation its own row, and each value its own cell. Therefore, table1 is a tidy dataset.

In table2 the cases and population do not have their own columns, but rather, are treated as values. table3 does not have the cases and population variables stored separately. Instead, both are stored as part of a calculation under rate. table4a stores the values of the year variable as separate variables (with their own columns), with the values being the number of cases (population is missing). table 4b is similar, but uses population for the values rather than cases.

  1. Compute the rate for table2, and table4a + table4b. You will need to perform four operations:

    1. Extract the number of TB cases per country per year.
    2. Extract the matching population per country per year.
    3. Divide cases by population, and multiply by 10000.
    4. Store back in the appropriate place.

    Which representation is easiest to work with? Which is hardest? Why?

Extract the number of TB cases per country per year:

table2_cases <- filter(table2, type == "cases")[["count"]]

Extract the matching population per country per year:

table2_population <- filter(table2, type == "population")[["count"]]

Divide cases by population, and multiply by 10000:

table2_rate <- table2_cases / table2_population * 10000

Store back in the appropriate place:

table2_country <- filter(table2, type == "cases")[["country"]]
table2_year <- filter(table2, type == "cases")[["year"]]

table2_new <- tibble(
  country = table2_country,
  year = table2_year,
  cases = table2_cases,
  population = table2_population,
  rate = table2_rate
  )
table2_new
## # A tibble: 6 x 5
##   country      year  cases population  rate
##   <chr>       <int>  <int>      <int> <dbl>
## 1 Afghanistan  1999    745   19987071 0.373
## 2 Afghanistan  2000   2666   20595360 1.29 
## 3 Brazil       1999  37737  172006362 2.19 
## 4 Brazil       2000  80488  174504898 4.61 
## 5 China        1999 212258 1272915272 1.67 
## 6 China        2000 213766 1280428583 1.67

For table4a and table4b:

table4_new <- tibble(
  country = table4a[["country"]],
  `1999` = table4a[["1999"]] / table4b[["1999"]] * 10000,
  `2000` = table4a[["2000"]] / table4b[["2000"]] * 10000
  )
table4_new
## # A tibble: 3 x 3
##   country     `1999` `2000`
##   <chr>        <dbl>  <dbl>
## 1 Afghanistan  0.373   1.29
## 2 Brazil       2.19    4.61
## 3 China        1.67    1.67
  1. Recreate the plot showing change in cases over time using table2 instead of table1. What do you need to do first?

Filter cases:

table2 %>%
  filter(type == "cases") %>% 
  ggplot(aes(year, count, group = country, colour = country)) +
    geom_line() +
    geom_point()

12.3.3 Exercises

  1. Why are gather() and spread() not perfectly symmetrical?
    Carefully consider the following example:

    stocks <- tibble(
      year   = c(2015, 2015, 2016, 2016),
      half  = c(   1,    2,     1,    2),
      return = c(1.88, 0.59, 0.92, 0.17)
    )
    stocks %>% 
      spread(year, return) %>% 
      gather("year", "return", `2015`:`2016`)

    (Hint: look at the variable types and think about column names.)

    Both spread() and gather() have a convert argument. What does it do?

gather() and spread() are not perfectly symmetrical because variable type is not retained when using both functions, as seen in the example above. In stocks, year is numeric:

stocks %>% select(year)
## # A tibble: 4 x 1
##    year
##   <dbl>
## 1 2015.
## 2 2015.
## 3 2016.
## 4 2016.

However, after using the spread() and gather() functions it becomes character:

stocks %>% 
  spread(year, return) %>% 
  gather("year", "return", `2015`:`2016`) %>% 
  select(year)
## # A tibble: 4 x 1
##   year 
##   <chr>
## 1 2015 
## 2 2015 
## 3 2016 
## 4 2016

The convert argument, when set to TRUE, will retain numeric, integer, or logical column types when using spread() or gather().

  1. Why does this code fail?

    table4a %>% 
      gather(1999, 2000, key = "year", value = "cases")
    ## Error in inds_combine(.vars, ind_list): Position must be between 0 and n

“1999” and “2000” are non-syntactic names so they have to be surrounded by backticks.

table4a %>% 
  gather(`1999`, `2000`, key = "year", value = "cases")
## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766
  1. Why does spreading this tibble fail? How could you add a new column to fix the problem?

    people <- tribble(
      ~name,             ~key,    ~value,
      #-----------------|--------|------
      "Phillip Woods",   "age",       45,
      "Phillip Woods",   "height",   186,
      "Phillip Woods",   "age",       50,
      "Jessica Cordero", "age",       37,
      "Jessica Cordero", "height",   156
    )
people %>% spread(key, value)
## Error: Duplicate identifiers for rows (1, 3)

Phillip Woods’ age has been entered twice. You could add a new column that indicates the time of observation:

people <- tribble(
  ~name,             ~key,    ~value, ~time,
  #-----------------|--------|------|-------
  "Phillip Woods",   "age",       45, 1,
  "Phillip Woods",   "height",   186, 1,
  "Phillip Woods",   "age",       50, 2,
  "Jessica Cordero", "age",       37, 1,
  "Jessica Cordero", "height",   156, 1
)
people %>% spread(key, value)
## # A tibble: 3 x 4
##   name             time   age height
##   <chr>           <dbl> <dbl>  <dbl>
## 1 Jessica Cordero    1.   37.   156.
## 2 Phillip Woods      1.   45.   186.
## 3 Phillip Woods      2.   50.    NA
  1. Tidy the simple tibble below. Do you need to spread or gather it? What are the variables?

    preg <- tribble(
      ~pregnant, ~male, ~female,
      "yes",     NA,    10,
      "no",      20,    12
    )

It needs to be gathered. The variables are sex (male or female) and pregnant (yes or no).

preg %>% gather(
  male, female,
  key = "sex",
  value = "count"
)
## # A tibble: 4 x 3
##   pregnant sex    count
##   <chr>    <chr>  <dbl>
## 1 yes      male     NA 
## 2 no       male     20.
## 3 yes      female   10.
## 4 no       female   12.

12.4.3 Exercises

  1. What do the extra and fill arguments do in separate()? Experiment with the various options for the following two toy datasets.

    tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% 
      separate(x, c("one", "two", "three"))
    
    tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% 
      separate(x, c("one", "two", "three"))

extra controls what happens when you use a character vector in sep and there are too many pieces.

tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% 
  separate(x, c("one", "two", "three"), extra = "merge")
## # A tibble: 3 x 3
##   one   two   three
##   <chr> <chr> <chr>
## 1 a     b     c    
## 2 d     e     f,g  
## 3 h     i     j

In comparison, fill controls what happens when there are not enough pieces.

tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% 
  separate(x, c("one", "two", "three"), fill = "left")
## # A tibble: 3 x 3
##   one   two   three
##   <chr> <chr> <chr>
## 1 a     b     c    
## 2 <NA>  d     e    
## 3 f     g     i
  1. Both unite() and separate() have a remove argument. What does it do? Why would you set it to FALSE?

It removes the input column from the data frame. By setting this to FALSE, you can retain the original input column in the dataset:

tibble(x = c("a,b,c", "d,e,f", "g,h,i")) %>% 
  separate(x, c("one", "two", "three"), remove = "FALSE")
## # A tibble: 3 x 4
##   x     one   two   three
##   <chr> <chr> <chr> <chr>
## 1 a,b,c a     b     c    
## 2 d,e,f d     e     f    
## 3 g,h,i g     h     i
  1. Compare and contrast separate() and extract(). Why are there three variations of separation (by position, by separator, and with groups), but only one unite?
tibble(x = c("a,b,c", "d,e,f", "g,h,i")) %>% 
  separate(x, c("one", "two", "three"))
## # A tibble: 3 x 3
##   one   two   three
##   <chr> <chr> <chr>
## 1 a     b     c    
## 2 d     e     f    
## 3 g     h     i
tibble(x = c("a,b,c", "d,e,f", "g,h,i"))
## # A tibble: 3 x 1
##   x    
##   <chr>
## 1 a,b,c
## 2 d,e,f
## 3 g,h,i
tibble(x = c("a,b,c", "d,e,f", "g,h,i")) %>% 
  extract(x, "A")
## # A tibble: 3 x 1
##   A    
##   <chr>
## 1 a    
## 2 d    
## 3 g

separate() pulls apart one column and places the values into many columns. extract() is a similar function, but, as its name implies, only retains values you wish to extract from a data frame.

unite() only requires one variation of separation because it is combining multiple columns into one column. Therefore, there is only one way to do it (using sep). In comparison, there are multiple ways to split a character string for separate() (e.g., by position or by separator).

12.5.1 Exercises

  1. Compare and contrast the fill arguments to spread() and complete().

For spread(), fill will replace missing values in a data set with this value. In comparison, the fill argument for complete() allows you to replace missing values by column name.

  1. What does the direction argument to fill() do?

It is the direction in which to fill missing values. For example:

df <- data.frame(Month = 1:12, Year = c(NA, 2000, rep(NA, 10)))
df %>% fill(Year, .direction = c("up"))
##    Month Year
## 1      1 2000
## 2      2 2000
## 3      3   NA
## 4      4   NA
## 5      5   NA
## 6      6   NA
## 7      7   NA
## 8      8   NA
## 9      9   NA
## 10    10   NA
## 11    11   NA
## 12    12   NA

12.6.1 Exercises

  1. In this case study I set na.rm = TRUE just to make it easier to check that we had the correct values. Is this reasonable? Think about how missing values are represented in this dataset. Are there implicit missing values? What’s the difference between an NA and zero?

I think this is reasonable. Missing values appear to be represented explicitely in this dataset:

who %>%
  gather(new_sp_m014:newrel_f65, key = "key", value = "cases") %>% 
  count(is.na(cases))
## # A tibble: 2 x 2
##   `is.na(cases)`      n
##   <lgl>           <int>
## 1 FALSE           76046
## 2 TRUE           329394

There are over 76,000 NA values for cases.

It is difficult to determine whether there are implicit missing values in a dataset because, as stated earlier in the chapter, implicit missing values are an “absence of a presence”.

NA indicates an explicit missing value. So, the number of actual cases for an NA observation is not necessarily zero. In comparison, a value of zero indicates there were no cases for a particular observation.

who %>%
  gather(new_sp_m014:newrel_f65, key = "key", value = "cases") %>% 
  filter(cases == 0) %>% 
  nrow()
## [1] 11080
  1. What happens if you neglect the mutate() step? (mutate(key = stringr::str_replace(key, "newrel", "new_rel")))
who %>%
  gather(code, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>% 
  separate(code, c("new", "var", "sexage")) %>% 
  select(-new, -iso2, -iso3) %>% 
  separate(sexage, c("sex", "age"), sep = 1)
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2580 rows
## [73467, 73468, 73469, 73470, 73471, 73472, 73473, 73474, 73475, 73476,
## 73477, 73478, 73479, 73480, 73481, 73482, 73483, 73484, 73485, 73486, ...].
## # A tibble: 76,046 x 6
##    country      year var   sex   age   value
##  * <chr>       <int> <chr> <chr> <chr> <int>
##  1 Afghanistan  1997 sp    m     014       0
##  2 Afghanistan  1998 sp    m     014      30
##  3 Afghanistan  1999 sp    m     014       8
##  4 Afghanistan  2000 sp    m     014      52
##  5 Afghanistan  2001 sp    m     014     129
##  6 Afghanistan  2002 sp    m     014      90
##  7 Afghanistan  2003 sp    m     014     127
##  8 Afghanistan  2004 sp    m     014     139
##  9 Afghanistan  2005 sp    m     014     151
## 10 Afghanistan  2006 sp    m     014     193
## # ... with 76,036 more rows

It won’t split the string newrel properly.

split_problem <- who %>%
  gather(code, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>% 
  separate(code, c("new", "var", "sexage")) %>% 
  select(-new, -iso2, -iso3) %>% 
  separate(sexage, c("sex", "age"), sep = 1)
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2580 rows
## [73467, 73468, 73469, 73470, 73471, 73472, 73473, 73474, 73475, 73476,
## 73477, 73478, 73479, 73480, 73481, 73482, 73483, 73484, 73485, 73486, ...].
split_problem %>% slice(73467:73486)
## # A tibble: 20 x 6
##    country              year var   sex   age   value
##    <chr>               <int> <chr> <chr> <chr> <int>
##  1 Afghanistan          2013 m014  <NA>  <NA>   1705
##  2 Albania              2013 m014  <NA>  <NA>     14
##  3 Algeria              2013 m014  <NA>  <NA>     25
##  4 Andorra              2013 m014  <NA>  <NA>      0
##  5 Angola               2013 m014  <NA>  <NA>    486
##  6 Anguilla             2013 m014  <NA>  <NA>      0
##  7 Antigua and Barbuda  2013 m014  <NA>  <NA>      1
##  8 Argentina            2013 m014  <NA>  <NA>    462
##  9 Armenia              2013 m014  <NA>  <NA>     25
## 10 Australia            2013 m014  <NA>  <NA>     28
## 11 Austria              2013 m014  <NA>  <NA>     15
## 12 Azerbaijan           2013 m014  <NA>  <NA>    125
## 13 Bahamas              2013 m014  <NA>  <NA>      3
## 14 Bahrain              2013 m014  <NA>  <NA>      4
## 15 Bangladesh           2013 m014  <NA>  <NA>   2323
## 16 Barbados             2013 m014  <NA>  <NA>      0
## 17 Belarus              2013 m014  <NA>  <NA>      4
## 18 Belgium              2013 m014  <NA>  <NA>     40
## 19 Belize               2013 m014  <NA>  <NA>      1
## 20 Benin                2013 m014  <NA>  <NA>     25

For relapse cases, gender and age group have not separated (they appear under var), and sex and age variables are NA.

  1. I claimed that iso2 and iso3 were redundant with country. Confirm this claim.

For each country, there is only one unique combination of iso2 and iso3:

who %>% 
  gather(new_sp_m014:newrel_f65, key = "key", value = "cases", na.rm = TRUE) %>% 
  select(country, iso2, iso3) %>% 
  distinct() %>% 
  group_by(country) %>% 
  filter(n() > 1)
## # A tibble: 0 x 3
## # Groups:   country [0]
## # ... with 3 variables: country <chr>, iso2 <chr>, iso3 <chr>
  1. For each country, year, and sex compute the total number of cases of TB. Make an informative visualisation of the data.
who_cases <- who %>%
  gather(code, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>% 
  mutate(code = stringr::str_replace(code, "newrel", "new_rel")) %>%
  separate(code, c("new", "var", "sexage")) %>% 
  select(-new, -iso2, -iso3) %>% 
  separate(sexage, c("sex", "age"), sep = 1) %>% 
  group_by(country, year, sex) %>% 
  summarise(cases = sum(value))

who_cases
## # A tibble: 6,921 x 4
## # Groups:   country, year [?]
##    country      year sex   cases
##    <chr>       <int> <chr> <int>
##  1 Afghanistan  1997 f       102
##  2 Afghanistan  1997 m        26
##  3 Afghanistan  1998 f      1207
##  4 Afghanistan  1998 m       571
##  5 Afghanistan  1999 f       517
##  6 Afghanistan  1999 m       228
##  7 Afghanistan  2000 f      1751
##  8 Afghanistan  2000 m       915
##  9 Afghanistan  2001 f      3062
## 10 Afghanistan  2001 m      1577
## # ... with 6,911 more rows
who_cases %>% 
  group_by(sex, year) %>%
  filter(year > 1995) %>% 
  summarise(cases_total = sum(cases)) %>%
  ggplot(aes(x = year, y = cases_total, group = sex, colour = sex)) +
    geom_line()

13. Relational data

13.2.1 Exercises

  1. Imagine you wanted to draw (approximately) the route each plane flies from its origin to its destination. What variables would you need? What tables would you need to combine?

From flights, you would need origin and dest. From airports, lat, lon (the locations of the airport), and maybe alt would be required. You would need to combine the flights and airports tables to get the origin airport and destination airport.

The weather table may also be helpful as wind speed, wind direction, and visibility may affect the route.

  1. I forgot to draw the relationship between weather and airports. What is the relationship and how should it appear in the diagram?

weather connects to airports via origin (the location). In the diagram, there should be an arrow pointing from faa in the airports table to origin in the weather table.

  1. weather only contains information for the origin (NYC) airports. If it contained weather records for all airports in the USA, what additional relation would it define with flights?

year, month, day, and hour.

  1. We know that some days of the year are “special”, and fewer people than usual fly on them. How might you represent that data as a data frame? What would be the primary keys of that table? How would it connect to the existing tables?

You would have date and name variables. The primary key of this table would be date. It would match to year, month, and day in other tables such as flights.

13.3.1 Exercises

  1. Add a surrogate key to flights.
flights %>% 
  arrange(year, month, day, sched_dep_time) %>% 
  mutate(flight_id = row_number())
## # A tibble: 336,776 x 20
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515        2.      830
##  2  2013     1     1      533            529        4.      850
##  3  2013     1     1      542            540        2.      923
##  4  2013     1     1      544            545       -1.     1004
##  5  2013     1     1      554            558       -4.      740
##  6  2013     1     1      559            559        0.      702
##  7  2013     1     1      554            600       -6.      812
##  8  2013     1     1      555            600       -5.      913
##  9  2013     1     1      557            600       -3.      709
## 10  2013     1     1      557            600       -3.      838
## # ... with 336,766 more rows, and 13 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, flight_id <int>
  1. Identify the keys in the following datasets

    1. Lahman::Batting,
    2. babynames::babynames
    3. nasaweather::atmos
    4. fueleconomy::vehicles
    5. ggplot2::diamonds

    (You might need to install some packages and read some documentation.)

Lahman::Batting:

The primary keys for Lahman::Batting are playerID, yearID, and stint.

Lahman::Batting %>% 
  count(playerID, yearID, stint) %>% 
  filter(n > 1)
## # A tibble: 0 x 4
## # ... with 4 variables: playerID <chr>, yearID <int>, stint <int>, n <int>

playerID is a foreign key (it is the primary key for Lahman::Master).

babynames::babynames:

The primary keys for babynames::babynames are year, sex, and name.

babynames::babynames %>%
  group_by(year, sex, name) %>%
  filter(n() > 1)
## # A tibble: 0 x 5
## # Groups:   year, sex, name [0]
## # ... with 5 variables: year <dbl>, sex <chr>, name <chr>, n <int>,
## #   prop <dbl>

nasaweather::atmos:

For nasaweather:atmos, the primary keys are lat, long, year, and month.

nasaweather::atmos %>% 
  group_by(lat, long, year, month) %>% 
  filter(n() > 1)
## # A tibble: 0 x 11
## # Groups:   lat, long, year, month [0]
## # ... with 11 variables: lat <dbl>, long <dbl>, year <int>, month <int>,
## #   surftemp <dbl>, temp <dbl>, pressure <dbl>, ozone <dbl>,
## #   cloudlow <dbl>, cloudmid <dbl>, cloudhigh <dbl>

fueleconomy::vehicles:

The primary key for fueleconomy::vehicles is id.

fueleconomy::vehicles %>% 
  count(id) %>% 
  filter(n > 1)
## # A tibble: 0 x 2
## # ... with 2 variables: id <int>, n <int>

ggplot2::diamonds:

ggplot2::diamonds does not appear to have a primary key.

ggplot2::diamonds %>% 
  group_by(carat, cut, color, clarity, depth, table, price, x, y, z) %>% 
  filter(n() > 1) %>% 
  nrow()
## [1] 289
  1. Draw a diagram illustrating the connections between the Batting, Master, and Salaries tables in the Lahman package. Draw another diagram that shows the relationship between Master, Managers, AwardsManagers.

    How would you characterise the relationship between the Batting, Pitching, and Fielding tables?

13.4.6 Exercises

  1. Compute the average delay by destination, then join on the airports data frame so you can show the spatial distribution of delays. Here’s an easy way to draw a map of the United States:

    airports %>%
      semi_join(flights, c("faa" = "dest")) %>%
      ggplot(aes(lon, lat)) +
        borders("state") +
        geom_point() +
        coord_quickmap()

    (Don’t worry if you don’t understand what semi_join() does — you’ll learn about it next.)

    You might want to use the size or colour of the points to display the average delay for each airport.

dest_delay <- flights %>% 
  group_by(dest) %>% 
  summarise(avg_delay = mean(arr_delay, na.rm = TRUE))

dest_delay %>% 
  left_join(airports, c("dest" = "faa")) %>% 
  ggplot(aes(lon, lat)) +
    borders("state") +
    geom_point(aes(colour = avg_delay)) +
    coord_quickmap()
## Warning: Removed 4 rows containing missing values (geom_point).

  1. Add the location of the origin and destination (i.e. the lat and lon) to flights.
flights %>% 
  left_join(airports, c("origin" = "faa")) %>% 
  left_join(airports, c("dest" = "faa")) %>% 
  rename(lat_origin = lat.x, lon_origin = lon.x) %>%
  rename(lat_dest = lat.y, lon_dest = lon.y) %>% 
  select(-contains("."))
## # A tibble: 336,776 x 23
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515        2.      830
##  2  2013     1     1      533            529        4.      850
##  3  2013     1     1      542            540        2.      923
##  4  2013     1     1      544            545       -1.     1004
##  5  2013     1     1      554            600       -6.      812
##  6  2013     1     1      554            558       -4.      740
##  7  2013     1     1      555            600       -5.      913
##  8  2013     1     1      557            600       -3.      709
##  9  2013     1     1      557            600       -3.      838
## 10  2013     1     1      558            600       -2.      753
## # ... with 336,766 more rows, and 16 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, lat_origin <dbl>, lon_origin <dbl>,
## #   lat_dest <dbl>, lon_dest <dbl>
  1. Is there a relationship between the age of a plane and its delays?
planes_avg_delay <- planes %>% 
  select(tailnum, year) %>% 
  left_join(flights, "tailnum") %>% 
  group_by(tailnum, year.x) %>%
  rename(year = year.x) %>% 
  summarise(avg_delay = mean(arr_delay, na.rm = TRUE))

ggplot(planes_avg_delay, aes(year, avg_delay)) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 76 rows containing non-finite values (stat_smooth).
## Warning: Removed 76 rows containing missing values (geom_point).

It appears not.

  1. What weather conditions make it more likely to see a delay?

I think rain and low visibility would increase the likelihood of delay.

flights_weather <- flights %>% 
  inner_join(weather, c("origin", "year", "month", "day", "hour"))

flights_weather %>% 
  group_by(visib) %>% 
  summarise(avg_delay = mean(dep_delay, na.rm = TRUE)) %>% 
  ggplot(aes(visib, avg_delay)) + 
    geom_point() +
    geom_line() +
    geom_smooth(method = lm, se = FALSE)

As visibility increases, the average delay decreases.

flights_weather %>% 
  group_by(precip) %>% 
  summarise(avg_delay = mean(dep_delay, na.rm = TRUE)) %>% 
  ggplot(aes(precip, avg_delay)) + 
    geom_point() +
    geom_line() +
    geom_smooth(method = lm, se = FALSE)

The relationship between rainfall and average delay is less clear.

  1. What happened on June 13 2013? Display the spatial pattern of delays, and then use Google to cross-reference with the weather.

There were a large series of storms in the south-east.

flights %>% 
  filter(month == 6, day == 13) %>% 
  group_by(hour) %>% 
  summarise(avg_delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 19 x 2
##     hour avg_delay
##    <dbl>     <dbl>
##  1    5.     -2.83
##  2    6.      3.54
##  3    7.      2.81
##  4    8.      7.91
##  5    9.     23.2 
##  6   10.     27.8 
##  7   11.     35.6 
##  8   12.     61.3 
##  9   13.     51.8 
## 10   14.     61.5 
## 11   15.     59.2 
## 12   16.     52.0 
## 13   17.     63.8 
## 14   18.     70.0 
## 15   19.     92.9 
## 16   20.     87.9 
## 17   21.     83.3 
## 18   22.    103.  
## 19   23.     25.3

By mid-morning there were significant delays. This continued into the evening.

flights %>% 
  filter(month == 6, day == 13) %>% 
  group_by(dest) %>% 
  summarise(avg_delay = mean(dep_delay, na.rm = TRUE)) %>% 
  inner_join(airports, by = c("dest" = "faa")) %>%
    ggplot(aes(lon, lat)) +
      borders("state") +
      geom_point(aes(colour = avg_delay, size = avg_delay)) +
      coord_quickmap()
## Warning: Removed 3 rows containing missing values (geom_point).

13.5.1 Exercises

  1. What does it mean for a flight to have a missing tailnum? What do the tail numbers that don’t have a matching record in planes have in common? (Hint: one variable explains ~90% of the problems.)
flights %>% 
  anti_join(planes, by = "tailnum")
## # A tibble: 52,606 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      558            600       -2.      753
##  2  2013     1     1      559            600       -1.      941
##  3  2013     1     1      600            600        0.      837
##  4  2013     1     1      602            605       -3.      821
##  5  2013     1     1      608            600        8.      807
##  6  2013     1     1      611            600       11.      945
##  7  2013     1     1      623            610       13.      920
##  8  2013     1     1      624            630       -6.      840
##  9  2013     1     1      628            630       -2.     1137
## 10  2013     1     1      629            630       -1.      824
## # ... with 52,596 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Many of the flights without a tail number in planes appear to be American Airlines (AA) or Envoy Air (MQ).

flights %>% 
  anti_join(planes, by = "tailnum") %>% 
  group_by(carrier) %>% 
  summarise(count = n())
## # A tibble: 10 x 2
##    carrier count
##    <chr>   <int>
##  1 9E       1044
##  2 AA      22558
##  3 B6        830
##  4 DL        110
##  5 F9         50
##  6 FL        187
##  7 MQ      25397
##  8 UA       1693
##  9 US        699
## 10 WN         38
flights %>% 
  anti_join(planes, by = "tailnum") %>% 
  nrow()
## [1] 52606

Specifically, about 90% of flights.

  1. Filter flights to only show flights with planes that have flown at least 100 flights.
flights_100 <- flights %>% 
  group_by(tailnum) %>% 
  count() %>% 
  filter(n >= 100 & tailnum != "NA") %>% 
  arrange(desc(n))

flights %>% 
  semi_join(flights_100, by = "tailnum")
## # A tibble: 228,390 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515        2.      830
##  2  2013     1     1      533            529        4.      850
##  3  2013     1     1      544            545       -1.     1004
##  4  2013     1     1      554            558       -4.      740
##  5  2013     1     1      555            600       -5.      913
##  6  2013     1     1      557            600       -3.      709
##  7  2013     1     1      557            600       -3.      838
##  8  2013     1     1      558            600       -2.      849
##  9  2013     1     1      558            600       -2.      853
## 10  2013     1     1      558            600       -2.      923
## # ... with 228,380 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
  1. Combine fueleconomy::vehicles and fueleconomy::common to find only the records for the most common models.
fueleconomy::vehicles %>% 
  semi_join(fueleconomy::common, by = c("make", "model"))
## # A tibble: 14,531 x 12
##       id make  model  year class trans drive   cyl displ fuel    hwy   cty
##    <int> <chr> <chr> <int> <chr> <chr> <chr> <int> <dbl> <chr> <int> <int>
##  1  1833 Acura Inte…  1986 Subc… Auto… Fron…     4  1.60 Regu…    28    22
##  2  1834 Acura Inte…  1986 Subc… Manu… Fron…     4  1.60 Regu…    28    23
##  3  3037 Acura Inte…  1987 Subc… Auto… Fron…     4  1.60 Regu…    28    22
##  4  3038 Acura Inte…  1987 Subc… Manu… Fron…     4  1.60 Regu…    28    23
##  5  4183 Acura Inte…  1988 Subc… Auto… Fron…     4  1.60 Regu…    27    22
##  6  4184 Acura Inte…  1988 Subc… Manu… Fron…     4  1.60 Regu…    28    23
##  7  5303 Acura Inte…  1989 Subc… Auto… Fron…     4  1.60 Regu…    27    22
##  8  5304 Acura Inte…  1989 Subc… Manu… Fron…     4  1.60 Regu…    28    23
##  9  6442 Acura Inte…  1990 Subc… Auto… Fron…     4  1.80 Regu…    24    20
## 10  6443 Acura Inte…  1990 Subc… Manu… Fron…     4  1.80 Regu…    26    21
## # ... with 14,521 more rows
  1. Find the 48 hours (over the course of the whole year) that have the worst delays. Cross-reference it with the weather data. Can you see any patterns?

  2. What does anti_join(flights, airports, by = c("dest" = "faa")) tell you? What does anti_join(airports, flights, by = c("faa" = "dest")) tell you?

anti_join(flights, airports, by = c("dest" = "faa"))
## # A tibble: 7,602 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      544            545       -1.     1004
##  2  2013     1     1      615            615        0.     1039
##  3  2013     1     1      628            630       -2.     1137
##  4  2013     1     1      701            700        1.     1123
##  5  2013     1     1      711            715       -4.     1151
##  6  2013     1     1      820            820        0.     1254
##  7  2013     1     1      820            820        0.     1249
##  8  2013     1     1      840            845       -5.     1311
##  9  2013     1     1      909            810       59.     1331
## 10  2013     1     1      913            918       -5.     1346
## # ... with 7,592 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

The destination of these flights is unknown (i.e., the airport does not appear in airports). They could be international flights.

anti_join(airports, flights, by = c("faa" = "dest"))
## # A tibble: 1,357 x 8
##    faa   name                   lat    lon   alt    tz dst   tzone        
##    <chr> <chr>                <dbl>  <dbl> <int> <dbl> <chr> <chr>        
##  1 04G   Lansdowne Airport     41.1  -80.6  1044   -5. A     America/New_…
##  2 06A   Moton Field Municip…  32.5  -85.7   264   -6. A     America/Chic…
##  3 06C   Schaumburg Regional   42.0  -88.1   801   -6. A     America/Chic…
##  4 06N   Randall Airport       41.4  -74.4   523   -5. A     America/New_…
##  5 09J   Jekyll Island Airpo…  31.1  -81.4    11   -5. A     America/New_…
##  6 0A9   Elizabethton Munici…  36.4  -82.2  1593   -5. A     America/New_…
##  7 0G6   Williams County Air…  41.5  -84.5   730   -5. A     America/New_…
##  8 0G7   Finger Lakes Region…  42.9  -76.8   492   -5. A     America/New_…
##  9 0P2   Shoestring Aviation…  39.8  -76.6  1000   -5. U     America/New_…
## 10 0S9   Jefferson County In…  48.1 -123.    108   -8. A     America/Los_…
## # ... with 1,347 more rows

These are the destination airports for all flights that departed NYC in 2013.

  1. You might expect that there’s an implicit relationship between plane and airline, because each plane is flown by a single airline. Confirm or reject this hypothesis using the tools you’ve learned above.
flights_carrier <- flights %>% 
  group_by(tailnum, carrier) %>% 
  count() %>% 
  arrange(tailnum)
flights_carrier
## # A tibble: 4,067 x 3
## # Groups:   tailnum, carrier [4,067]
##    tailnum carrier     n
##    <chr>   <chr>   <int>
##  1 D942DN  DL          4
##  2 N0EGMQ  MQ        371
##  3 N10156  EV        153
##  4 N102UW  US         48
##  5 N103US  US         46
##  6 N104UW  US         47
##  7 N10575  EV        289
##  8 N105UW  US         45
##  9 N107US  US         41
## 10 N108UW  US         60
## # ... with 4,057 more rows
flights_carrier %>% 
  group_by(tailnum) %>% 
  count() %>% 
  filter(nn > 1)
## # A tibble: 18 x 2
## # Groups:   tailnum [18]
##    tailnum    nn
##    <chr>   <int>
##  1 N146PQ      2
##  2 N153PQ      2
##  3 N176PQ      2
##  4 N181PQ      2
##  5 N197PQ      2
##  6 N200PQ      2
##  7 N228PQ      2
##  8 N232PQ      2
##  9 N933AT      2
## 10 N935AT      2
## 11 N977AT      2
## 12 N978AT      2
## 13 N979AT      2
## 14 N981AT      2
## 15 N989AT      2
## 16 N990AT      2
## 17 N994AT      2
## 18 <NA>        7

There are a number of planes that have been flown by multiple airlines. However, the majority of planes have had a single airline.

14. Strings

To come…

15. Factors

15.3.1 Exercises

  1. Explore the distribution of rincome (reported income). What makes the default bar chart hard to understand? How could you improve the plot?
summary(gss_cat$rincome)
##      No answer     Don't know        Refused $25000 or more $20000 - 24999 
##            183            267            975           7363           1283 
## $15000 - 19999 $10000 - 14999  $8000 to 9999  $7000 to 7999  $6000 to 6999 
##           1048           1168            340            188            215 
##  $5000 to 5999  $4000 to 4999  $3000 to 3999  $1000 to 2999       Lt $1000 
##            227            226            276            395            286 
## Not applicable 
##           7043
ggplot(gss_cat, aes(rincome)) +
  geom_bar()

This plot can be improved by rotating the labels on the x-axis:

ggplot(gss_cat, aes(rincome)) +
  geom_bar() +
  theme(axis.text.x = element_text(angle = 35, hjust = 1))

It would also be better if the categories on the x-axis were ordered from smallest to largest. To do this, you would have to modify the order of the factor levels. The following categories can also be combined: “No answer”, “Don’t know”, “Refused”, and “Not applicable”.

  1. What is the most common relig in this survey? What’s the most common partyid?
gss_cat %>% 
  count(relig) %>% 
  arrange(desc(n)) %>% 
  head(1)
## # A tibble: 1 x 2
##   relig          n
##   <fct>      <int>
## 1 Protestant 10846
gss_cat %>% 
  count(partyid) %>% 
  arrange(desc(n)) %>% 
  head(1)
## # A tibble: 1 x 2
##   partyid         n
##   <fct>       <int>
## 1 Independent  4119

“Protestant”" is the most common religion.

The most common party affiliation is “Independent”.

  1. Which relig does denom (denomination) apply to? How can you find out with a table? How can you find out with a visualisation?
levels(gss_cat$denom)
##  [1] "No answer"            "Don't know"           "No denomination"     
##  [4] "Other"                "Episcopal"            "Presbyterian-dk wh"  
##  [7] "Presbyterian, merged" "Other presbyterian"   "United pres ch in us"
## [10] "Presbyterian c in us" "Lutheran-dk which"    "Evangelical luth"    
## [13] "Other lutheran"       "Wi evan luth synod"   "Lutheran-mo synod"   
## [16] "Luth ch in america"   "Am lutheran"          "Methodist-dk which"  
## [19] "Other methodist"      "United methodist"     "Afr meth ep zion"    
## [22] "Afr meth episcopal"   "Baptist-dk which"     "Other baptists"      
## [25] "Southern baptist"     "Nat bapt conv usa"    "Nat bapt conv of am" 
## [28] "Am bapt ch in usa"    "Am baptist asso"      "Not applicable"
no_denom <- gss_cat %>%
  filter(!denom %in% c("No answer", "Don't know", "No denomination", "Not applicable", "Refused"))
no_denom %>% 
  count(relig)
## # A tibble: 1 x 2
##   relig          n
##   <fct>      <int>
## 1 Protestant  9559
ggplot(no_denom, aes(relig, denom)) +
  geom_count(aes(colour = ..n.., size = ..n..)) +
  guides(colour = 'legend')

Denomination applies to “Protestant”.

15.4.1 Exercises

  1. There are some suspiciously high numbers in tvhours. Is the mean a good summary?
summary(gss_cat$tvhours)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   1.000   2.000   2.981   4.000  24.000   10146
ggplot(gss_cat, aes(tvhours)) +
  geom_bar()
## Warning: Removed 10146 rows containing non-finite values (stat_count).

gss_cat %>% 
  count(tvhours) %>% 
  arrange(desc(tvhours))
## # A tibble: 25 x 2
##    tvhours     n
##      <int> <int>
##  1      24    22
##  2      23     1
##  3      22     2
##  4      21     2
##  5      20    14
##  6      18     7
##  7      17     2
##  8      16    10
##  9      15    17
## 10      14    24
## # ... with 15 more rows

It is unlikely that people are watching 24 hours of tv per day. As seen in the bar plot, tvhours is skewed to the right due to the presence of these extreme values. The median would be a better summary of the data, because, unlike the mean, it is largely unaffected by outliers.

  1. For each factor in gss_cat identify whether the order of the levels is arbitrary or principled.
glimpse(gss_cat)
## Observations: 21,483
## Variables: 9
## $ year    <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, ...
## $ marital <fct> Never married, Divorced, Widowed, Never married, Divor...
## $ age     <int> 26, 48, 67, 39, 25, 25, 36, 44, 44, 47, 53, 52, 52, 51...
## $ race    <fct> White, White, White, White, White, White, White, White...
## $ rincome <fct> $8000 to 9999, $8000 to 9999, Not applicable, Not appl...
## $ partyid <fct> Ind,near rep, Not str republican, Independent, Ind,nea...
## $ relig   <fct> Protestant, Protestant, Protestant, Orthodox-christian...
## $ denom   <fct> Southern baptist, Baptist-dk which, No denomination, N...
## $ tvhours <int> 12, NA, 2, 4, 1, NA, 3, NA, 0, 3, 2, NA, 1, NA, 1, 7, ...
levels(gss_cat$marital)
## [1] "No answer"     "Never married" "Separated"     "Divorced"     
## [5] "Widowed"       "Married"
levels(gss_cat$rincome)
##  [1] "No answer"      "Don't know"     "Refused"        "$25000 or more"
##  [5] "$20000 - 24999" "$15000 - 19999" "$10000 - 14999" "$8000 to 9999" 
##  [9] "$7000 to 7999"  "$6000 to 6999"  "$5000 to 5999"  "$4000 to 4999" 
## [13] "$3000 to 3999"  "$1000 to 2999"  "Lt $1000"       "Not applicable"

marital, race partyid, relig, and denom are arbitrary (i.e., nominal). rincome is principled (i.e., ordinal).

  1. Why did moving “Not applicable” to the front of the levels move it to the bottom of the plot?
rincome_summary <- gss_cat %>%
  group_by(rincome) %>%
  summarise(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  )

ggplot(rincome_summary, aes(age, fct_relevel(rincome, "Not applicable"))) +
  geom_point()

Because fct_relevel gives “Not applicable” an integer value of 1:

rincome_levels <- fct_relevel(rincome_summary$rincome, "Not applicable")
levels(rincome_levels)
##  [1] "Not applicable" "No answer"      "Don't know"     "Refused"       
##  [5] "$25000 or more" "$20000 - 24999" "$15000 - 19999" "$10000 - 14999"
##  [9] "$8000 to 9999"  "$7000 to 7999"  "$6000 to 6999"  "$5000 to 5999" 
## [13] "$4000 to 4999"  "$3000 to 3999"  "$1000 to 2999"  "Lt $1000"

15.5.1 Exercises

  1. How have the proportions of people identifying as Democrat, Republican, and Independent changed over time?
partyid_clean <- gss_cat %>%
  mutate(partyid = fct_collapse(
    partyid,
    other = c("No answer", "Don't know", "Other party"),
    rep = c("Strong republican", "Not str republican"),
    ind = c("Ind,near rep", "Independent", "Ind,near dem"),
    dem = c("Not str democrat", "Strong democrat")
  ))

partyid_counts <- partyid_clean %>% 
  count(year, partyid) %>% 
  group_by(year) %>% 
  mutate(prop = n / sum(n))

ggplot(partyid_counts, aes(x = year, y = prop, colour = fct_reorder2(
  partyid, year, prop
))) +
  geom_point() +
  geom_line() +
  labs(colour = "partyid")

There appears to be less people identifying as Republican over time. In comparison, the proportion of people identifying as Independent appears to be increasing, while the number of Democrats appear about the same.

There is also a slight increase in the number of people identifying as Other since around 2005.

  1. How could you collapse rincome into a small set of categories?
gss_cat %>% 
  count(rincome)
## # A tibble: 16 x 2
##    rincome            n
##    <fct>          <int>
##  1 No answer        183
##  2 Don't know       267
##  3 Refused          975
##  4 $25000 or more  7363
##  5 $20000 - 24999  1283
##  6 $15000 - 19999  1048
##  7 $10000 - 14999  1168
##  8 $8000 to 9999    340
##  9 $7000 to 7999    188
## 10 $6000 to 6999    215
## 11 $5000 to 5999    227
## 12 $4000 to 4999    226
## 13 $3000 to 3999    276
## 14 $1000 to 2999    395
## 15 Lt $1000         286
## 16 Not applicable  7043
gss_cat %>%
  mutate(rincome = fct_lump(rincome, n = 5)) %>%
  count(rincome, sort = TRUE) %>%
  mutate(rincome = fct_recode(
    rincome,
    "Less than $10000" = "Other"
  ))
## # A tibble: 6 x 2
##   rincome              n
##   <fct>            <int>
## 1 $25000 or more    7363
## 2 Not applicable    7043
## 3 Less than $10000  3578
## 4 $20000 - 24999    1283
## 5 $10000 - 14999    1168
## 6 $15000 - 19999    1048

16. Dates and times

16.2.4 Exercises

  1. What happens if you parse a string that contains invalid dates?

    ymd(c("2010-10-10", "bananas"))

The invalid dates will be NA.

  1. What does the tzone argument to today() do? Why is it important?

tzone specifies which time zone you would like to find the current date of.

  1. Use the appropriate lubridate function to parse each of the following dates:

    d1 <- "January 1, 2010"
    d2 <- "2015-Mar-07"
    d3 <- "06-Jun-2017"
    d4 <- c("August 19 (2015)", "July 1 (2015)")
    d5 <- "12/30/14" # Dec 30, 2014
mdy(d1)
## [1] "2010-01-01"
ymd(d2)
## [1] "2015-03-07"
dmy(d3)
## [1] "2017-06-06"
mdy(d4)
## [1] "2015-08-19" "2015-07-01"
mdy(d5)
## [1] "2014-12-30"

16.3.4 Exercises

  1. How does the distribution of flight times within a day change over the course of the year?
flights_dt %>% 
  mutate(hour = hour(dep_time), month = factor(month(dep_time))) %>% 
  group_by(hour, month) %>% 
  summarise(
    n = n()
  ) %>% 
  ggplot(aes(x = hour, y = n, colour = month)) +
    geom_line()

For the first day of each month it looks about the same.

  1. Compare dep_time, sched_dep_time and dep_delay. Are they consistent? Explain your findings.

  2. Compare air_time with the duration between the departure and arrival. Explain your findings. (Hint: consider the location of the airport.)

  3. How does the average delay time change over the course of a day? Should you use dep_time or sched_dep_time? Why?

  4. On what day of the week should you leave if you want to minimise the chance of a delay?

flights_dt %>% 
  mutate(day = wday(sched_dep_time, label = TRUE)) %>% 
  group_by(day) %>% 
  summarise(dep_delay = mean(dep_delay)) %>%
  ggplot(aes(day, dep_delay)) +
    geom_col()

Saturday.

  1. What makes the distribution of diamonds$carat and flights$sched_dep_time similar?

  2. Confirm my hypothesis that the early departures of flights in minutes 20-30 and 50-60 are caused by scheduled flights that leave early. Hint: create a binary variable that tells you whether or not a flight was delayed.

19. Functions

19.2.1 Practice

  1. Why is TRUE not a parameter to rescale01()? What would happen if x contained a single missing value, and na.rm was FALSE?

Because it is not an input. The function takes only one input (x) for computing. na.rm specifies whether NA values should be removed or not prior to the calculation.

rescale01 <- function(x) {
  rng <- range(x, na.rm = TRUE)
  (x - rng[1]) / (rng[2] - rng[1])
}

rescale01(c(0, 5))
## [1] 0 1

So far so good.

rescale01 <- function(x) {
  rng <- range(x, na.rm = FALSE)
  (x - rng[1]) / (rng[2] - rng[1])
}

rescale01(c(0, 5, NA))
## [1] NA NA NA

When na.rm is FALSE, NA values are contagious.

  1. In the second variant of rescale01(), infinite values are left unchanged. Rewrite rescale01() so that -Inf is mapped to 0, and Inf is mapped to 1.
x <- c(1:10, Inf)

rescale01_remapped <- function(x) {
  rng <- range(x, na.rm = TRUE, finite = TRUE)
  y <- (x - rng[1]) / (rng[2] - rng[1])
  y[y == -Inf] <- 0
  y[y == Inf] <- 1
  y
}

rescale01_remapped(x)
##  [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667
##  [8] 0.7777778 0.8888889 1.0000000 1.0000000
  1. Practice turning the following code snippets into functions. Think about what each function does. What would you call it? How many arguments does it need? Can you rewrite it to be more expressive or less duplicative?

    mean(is.na(x))
    
    x / sum(x, na.rm = TRUE)
    
    sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE)

mean() calculates the mean of an object. is.na() checks whether there are any missing values in an object and returns a logical vector.

x <- c(1, 5, 10, 25, 25)
mean(x)
## [1] 13.2
y <- c(2, 5, NA, NA)

Therefore, mean(is.na(x)) calculates the mean of a logical vector, where TRUE is equal to 1, and FALSE is equal to 0.

So, this function calculates the proportion of missing values in an object. It can be turned into a function like this:

prop_NA <- function(x){
  mean(is.na(x))
}

x <- c(1, 10, NA, NA)

prop_NA(x)
## [1] 0.5
x <- c(1, 5, 10, 25, 25)
x / sum(x, na.rm = TRUE)
## [1] 0.01515152 0.07575758 0.15151515 0.37878788 0.37878788

The code above calculates the proportion of each value in a numeric vector by dividing each value by the total sum of the vector. It also removes any missing values prior to calculation.

prop <- function(x){
  x / sum(x, na.rm = TRUE)
}

prop(x)
## [1] 0.01515152 0.07575758 0.15151515 0.37878788 0.37878788
sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE)
## [1] 0.8510513

Finally, the last snippet of code divides the standard deviation of a numeric vector by the mean. According to Wikipedia, this is known as the coefficient of variation:

… the coefficient of variation, also known as relative standard deviation, is a standardized measure of dispersion of a probability distribution or frequency distribution.

It can be turned into a function like this:

coeff_var <- function(x){
  sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE)
}

coeff_var(x)
## [1] 0.8510513
  1. Follow http://nicercode.github.io/intro/writing-functions.html to write your own functions to compute the variance and skew of a numeric vector.
# Variance
x <- sample(1:100, 100, replace = TRUE)

sample_var <- function(x){
n <- length(x)
m <- mean(x)
(1 / (n - 1)) * sum((x - m) ^ 2)
}
sample_var(x)
## [1] 877.0461
# Computes the same result as the built-in function:
var(x)
## [1] 877.0461
# Skew
x <- sample(1:100, 100, replace = TRUE)

skew <- function(x){
  n <- length(x)
  m <- mean(x)
  sd <- sd(x)
(1 / n) * sum(x - m) ^ 3 / sd ^ 3
}
skew(x)
## [1] 6.646333e-46
  1. Write both_na(), a function that takes two vectors of the same length and returns the number of positions that have an NA in both vectors.
x <- c(1, 4, 2, NA, NA)
y <- c(3, 5, NA, 1, NA)

both_na <- function(x, y){
  sum(is.na(x) & is.na(y))
}
  1. What do the following functions do? Why are they useful even though they are so short?

    is_directory <- function(x) file.info(x)$isdir
    is_readable <- function(x) file.access(x, 4) == 0
is_directory("~/Google Drive/R/r4ds-solutions/data")
## [1] TRUE
is_readable("index.html")
## index.html 
##       TRUE

is_directory() can determine whether a file path is a directory. is_readable() tests whether a particular file is readable or not. These functions are helpful because the function names make it more readily apparent what each function does without the user having to look through the code.

  1. Read the complete lyrics to “Little Bunny Foo Foo”. There’s a lot of duplication in this song. Extend the initial piping example to recreate the complete song, and use functions to reduce the duplication.
foo_foo_mischief <- function(x){
foo_foo %>%
  hop(through = forest) %>%
  scoop(up = field_mouse) %>%
  bop(on = head)
}

foo_foo_mischief()
good_fairy %>% 
  says("Little bunny Foo Foo
        I don't want to see you
        Scooping up the field mice
        And bopping them on the head.
        I'll give you three chances,
        And if you don't stop, I'll turn you into a GOON!")

19.3.1 Exercises

  1. Read the source code for each of the following three functions, puzzle out what they do, and then brainstorm better names.

    f1 <- function(string, prefix) {
      substr(string, 1, nchar(prefix)) == prefix
    }
    f2 <- function(x) {
      if (length(x) <= 1) return(NULL)
      x[-length(x)]
    }
    f3 <- function(x, y) {
      rep(y, length.out = length(x))
    }

The first function checks whether a string has a given prefix.

f1("hello world", "hello")
## [1] TRUE

substr() extracts substrings from a character vector by starting from the first element (1), and by using the number of characters in the prefix as the last/stop value. Then, it evaluates whether the substring is equal to that of the prefix.

substr("hello world", 1, nchar("hello"))
## [1] "hello"
"hello" == "hello"
## [1] TRUE

A better name for this function might be:

check_prefix <- function(string, prefix) {
  substr(string, 1, nchar(prefix)) == prefix
}

The second function removes the last observation in a vector (this doesn’t necessarily have to be an integer). For example:

remove_last <- function(x) {
  if (length(x) <= 1) return(NULL)
  x[-length(x)]
}

x <- sample(1:10, 10, replace = TRUE)
y <- c("hello", "world")

remove_last(x)
## [1]  2  5  3  9  8 10  6  5  5
remove_last(y)
## [1] "hello"

The last function replicates the values in y based on the length of x.

repeat_values <- function(x, y) {
  rep(y, length.out = length(x))
}

x <- c(1, 3, 5, 3)
y <- c("hello", "world")

repeat_values(x, y)
## [1] "hello" "world" "hello" "world"
  1. Take a function that you’ve written recently and spend 5 minutes brainstorming a better name for it and its arguments.
df <- tibble(
  a = c("47%", "23%", "97%", "88%", "12%")
)

remove_percentage <- function(column) {
  parse_double(gsub("[%]", "", column))
}

df$a <- remove_percentage(df$a)
df
## # A tibble: 5 x 1
##       a
##   <dbl>
## 1   47.
## 2   23.
## 3   97.
## 4   88.
## 5   12.
  1. Compare and contrast rnorm() and MASS::mvrnorm(). How could you make them more consistent?

rnorm() produces random numbers that are normally distributed with a mean equal to 0 and a standard deviation equal to 1. In comparison, MASS::mvrnorm() produces samples from a multivariate normal distribution. To make them more consistent, they should use the same argument names. For example, MASS::mvrnorm() uses mu, where as rnorm() uses mean. Using consistent names between the functions would make them easier to use.

  1. Make a case for why norm_r(), norm_d() etc would be better than rnorm(), dnorm(). Make a case for the opposite.

Using norm_r(), norm_d() instead of rnorm(), dnorm() would be better because it would group the functions related to the normal distribution. In comparison, using rnorm(), dnorm() could be better because it groups them by actions. For example, r* functions involve sampling from distributions.

19.4.4 Exercises

  1. What’s the difference between if and ifelse()? Carefully read the help and construct three examples that illustrate the key differences.

ifelse() returns a vector of the same length as test filled with elements from the yes or no argument, depending on whether the test is TRUE or FALSE.

y <- sample(1:10, 10, replace = TRUE)
x <- ifelse(y < 5, "Too low", "Too high")
x
##  [1] "Too low"  "Too high" "Too low"  "Too low"  "Too high" "Too low" 
##  [7] "Too low"  "Too low"  "Too high" "Too low"

In comparison, if will return one value only because the condition only works with a length-one logical vector.

y <- sample(1:10, 10, replace = TRUE)
x <- if (y < 20) "Too low" else "Too high"
## Warning in if (y < 20) "Too low" else "Too high": the condition has length
## > 1 and only the first element will be used
x
## [1] "Too low"
y <- 10
x <- if (y < 20) "Too low" else "Too high"
x
## [1] "Too low"
  1. Write a greeting function that says “good morning”, “good afternoon”, or “good evening”, depending on the time of day. (Hint: use a time argument that defaults to lubridate::now(). That will make it easier to test your function.)
greeting <- function(time = lubridate::now()) {
  hour <- hour(time)
  if (hour > 0 && hour < 12) {
      "good morning"
    } else if (hour >= 12 & hour < 18) {
      "good afternoon"
    } else {
      "good evening"
    }
}

greeting()
## [1] "good morning"
  1. Implement a fizzbuzz function. It takes a single number as input. If the number is divisible by three, it returns “fizz”. If it’s divisible by five it returns “buzz”. If it’s divisible by three and five, it returns “fizzbuzz”. Otherwise, it returns the number. Make sure you first write working code before you create the function.
fizzbuzz <- function(x) {
  if (x %% 3 == 0 && x %% 5 == 0) {
    "fizzbuzz"
  } else if (x %% 3 == 0) {
    "fizz"
  } else if (x %% 5 == 0) {
    "buzz"
  } else {
    x
  }
}

# Test function
fizzbuzz(15)
## [1] "fizzbuzz"
fizzbuzz(3)
## [1] "fizz"
fizzbuzz(5)
## [1] "buzz"
fizzbuzz(2)
## [1] 2
  1. How could you use cut() to simplify this set of nested if-else statements?

    if (temp <= 0) {
      "freezing"
    } else if (temp <= 10) {
      "cold"
    } else if (temp <= 20) {
      "cool"
    } else if (temp <= 30) {
      "warm"
    } else {
      "hot"
    }

    How would you change the call to cut() if I’d used < instead of <=? What is the other chief advantage of cut() for this problem? (Hint: what happens if you have many values in temp?)

cut() divides the values of a numeric vector into intervals and recodes these values according to which interval they fall.

x <- 1:10
cut(x, breaks = c(1, 6, 10), include.lowest = TRUE, right = FALSE)
##  [1] [1,6)  [1,6)  [1,6)  [1,6)  [1,6)  [6,10] [6,10] [6,10] [6,10] [6,10]
## Levels: [1,6) [6,10]

To simplify the example, you can replace the nested if-else statements with cut():

temp <- seq(-5, 40, by = 5)

cut(temp, breaks = c(-Inf, 0, 10, 20, 30, Inf), right = TRUE, labels = c("freezing", "cold", "cool", "warm", "hot"))
##  [1] freezing freezing cold     cold     cool     cool     warm    
##  [8] warm     hot      hot     
## Levels: freezing cold cool warm hot

If < was used instead of <=, you would have to change the right argument in cut():

temp <- seq(-5, 40, by = 5)

cut(temp, breaks = c(-Inf, 0, 10, 20, 30, Inf), right = FALSE, labels = c("freezing", "cold", "cool", "warm", "hot"))
##  [1] freezing cold     cold     cool     cool     warm     warm    
##  [8] hot      hot      hot     
## Levels: freezing cold cool warm hot

The main advantage of using cut() for this problem is that it takes a vector of multiple values, not just one (as is the case with an if-else statement).

  1. What happens if you use switch() with numeric values?

The first argument of switch (EXPR) determines which element of ... is returned.

switch(EXPR = 2, 5, 6, 7, 8)
## [1] 6

In this example, EXPR = 2, so the second element from ... (6) is returned.

  1. What does this switch() call do? What happens if x is “e”?

    switch(x, 
      a = ,
      b = "ab",
      c = ,
      d = "cd"
    )

    Experiment, then carefully read the documentation.

It takes the EXPR (x) and looks for a value in ... (stored in the elements a, b, c, and d) that partially matches and returns that value.

x <- "a"

switch(x, 
       a = ,
       b = "ab",
       c = ,
       d = "cd"
)
## [1] "ab"
x <- "e"

switch(x, 
       a = ,
       b = "ab",
       c = ,
       d = "cd"
)

If x is “e” then it returns nothing.

19.5.5 Exercises

  1. What does commas(letters, collapse = "-") do? Why?

It collapses letters into a length-one vector, placing “-” in between each letter. I think the argument collapse might be better named sep or something similar.

  1. It’d be nice if you could supply multiple characters to the pad argument, e.g. rule("Title", pad = "-+"). Why doesn’t this currently work? How could you fix it?
rule <- function(..., pad = "-") {
  title <- paste0(...)
  width <- getOption("width") - nchar(title) - 5
  cat(title, " ", stringr::str_dup(pad, width), "\n", sep = "")
}

rule("Title", pad = "-+")
## Title -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

It does appear to work.

  1. What does the trim argument to mean() do? When might you use it?

trim allows you to remove observations from each end of x before computing the mean.

x <- sample(1:10, 10, replace = TRUE)
mean(x)
## [1] 5.3
mean(x, trim = 0.1)
## [1] 5.25
  1. The default value for the method argument to cor() is c("pearson", "kendall", "spearman"). What does that mean? What value is used by default?

These are the available methods for computing a correlation (it will not accept any other strings). pearson is the default.

random_sample <- function(x) {
  sample(1:100, 50, replace = TRUE)
}

x <- random_sample(x)
y <- random_sample(x)

cor(x, y, method = "spearman")
## [1] 0.04182181
cor(x, y, method = "caveman") # throws an error
## Error in match.arg(method): 'arg' should be one of "pearson", "kendall", "spearman"

21. Iteration

21.2.1 Exercises

  1. Write for loops to:

    1. Compute the mean of every column in mtcars.
    2. Determine the type of each column in nycflights13::flights.
    3. Compute the number of unique values in each column of iris.
    4. Generate 10 random normals for each of \(\mu = -10\), \(0\), \(10\), and \(100\).

    Think about the output, sequence, and body before you start writing the loop.

Compute the mean of every column in mtcars:

mean <- vector("double", ncol(mtcars))
names(mean) <- names(mtcars)
for (i in seq_along(mtcars)) {
  mean[[i]] <- mean(mtcars[[i]])
}
mean
##        mpg        cyl       disp         hp       drat         wt 
##  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250 
##       qsec         vs         am       gear       carb 
##  17.848750   0.437500   0.406250   3.687500   2.812500

Determine the type of each column in nycflights13::flights:

col_type <- vector("list", ncol(flights))
names(col_type) <- names(flights)
for (i in seq_along(flights)) {
  col_type[[i]] <- class(flights[[i]])
}
col_type
## $year
## [1] "integer"
## 
## $month
## [1] "integer"
## 
## $day
## [1] "integer"
## 
## $dep_time
## [1] "integer"
## 
## $sched_dep_time
## [1] "integer"
## 
## $dep_delay
## [1] "numeric"
## 
## $arr_time
## [1] "integer"
## 
## $sched_arr_time
## [1] "integer"
## 
## $arr_delay
## [1] "numeric"
## 
## $carrier
## [1] "character"
## 
## $flight
## [1] "integer"
## 
## $tailnum
## [1] "character"
## 
## $origin
## [1] "character"
## 
## $dest
## [1] "character"
## 
## $air_time
## [1] "numeric"
## 
## $distance
## [1] "numeric"
## 
## $hour
## [1] "numeric"
## 
## $minute
## [1] "numeric"
## 
## $time_hour
## [1] "POSIXct" "POSIXt"

Compute the number of unique values in each column of iris:

unique_values <- vector("integer", ncol(iris))
names(unique_values) <- names(iris)
for (i in seq_along(iris)) {
  unique_values[[i]] <- n_distinct(iris[[i]])
}
unique_values
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##           35           23           43           22            3
# Check the above results
fct_unique(iris$Species)
## [1] setosa     versicolor virginica 
## Levels: setosa versicolor virginica

Generate 10 random normals for each of \(\mu = -10\), \(0\), \(10\), and \(100\):

mu <- c(-10, 0, 10, 100)

random_normals <- vector("list", length(mu))
for (i in seq_along(random_normals)) {
  random_normals[[i]] <- rnorm(10, mean = mu[[i]])
}
random_normals
## [[1]]
##  [1]  -9.668042 -12.752640 -11.666053 -11.786233 -11.171829 -10.472993
##  [7] -10.390668  -9.098248  -9.370741 -10.759146
## 
## [[2]]
##  [1] -1.57981451 -0.69247715 -0.32280781 -0.41258331 -0.42207960
##  [6]  1.32236125  0.03267892  1.81945278 -0.79630745 -1.90735874
## 
## [[3]]
##  [1] 10.305425  9.566745 10.780071 10.996345 10.745244 11.101034  9.884776
##  [8] 10.639426  9.819239 10.231762
## 
## [[4]]
##  [1] 100.85902  99.03889  99.76429 100.08287 100.84113  98.01641  98.80659
##  [8]  98.78463 100.69881  99.15648
  1. Eliminate the for loop in each of the following examples by taking advantage of an existing function that works with vectors:

    out <- ""
    for (x in letters) {
      out <- stringr::str_c(out, x)
    }
    
    x <- sample(100)
    sd <- 0
    for (i in seq_along(x)) {
      sd <- sd + (x[i] - mean(x)) ^ 2
    }
    sd <- sqrt(sd / (length(x) - 1))
    
    x <- runif(100)
    out <- vector("numeric", length(x))
    out[1] <- x[1]
    for (i in 2:length(x)) {
      out[i] <- out[i - 1] + x[i]
    }
  2. Combine your function writing and for loop skills:

    1. Write a for loop that prints() the lyrics to the children’s song “Alice the camel”.

    2. Convert the nursery rhyme “ten in the bed” to a function. Generalise it to any number of people in any sleeping structure.

    3. Convert the song “99 bottles of beer on the wall” to a function. Generalise to any number of any vessel containing any liquid on any surface.

  3. It’s common to see for loops that don’t preallocate the output and instead increase the length of a vector at each step:

    output <- vector("integer", 0)
    for (i in seq_along(x)) {
      output <- c(output, lengths(x[[i]]))
    }
    output

    How does this affect performance? Design and execute an experiment.

Preallocated loops perform increasingly better as inputs increase in size.

27. R Markdown

27.2.1 Exercises

  1. Create a new notebook using File > New File > R Notebook. Read the instructions. Practice running the chunks. Verify that you can modify the code, re-run it, and see modified output.

  2. Create a new R Markdown document with File > New File > R Markdown… Knit it by clicking the appropriate button. Knit it by using the appropriate keyboard short cut. Verify that you can modify the input and see the output update.

  3. Compare and contrast the R notebook and R markdown files you created above. How are the outputs similar? How are they different? How are the inputs similar? How are they different? What happens if you copy the YAML header from one to the other?

  • R notebook files are suitable for capturing what you did, what you were thinking, and communicating to other data scientists
  • R markdown files are suitable for communicating to decision-makers who don’t care about the code
  • An R notebook file allows the user to show/hide code and download the file; an R markdown file does not
  • The R markdown file contains additional information in the YAML header (author and date) by default
  • When you copy the YAML header from one to the other it changes the type of file
    • output: determines the type of file that will be created
  1. Create one new R Markdown document for each of the three built-in formats: HTML, PDF and Word. Knit each of the three documents. How does the output differ? How does the input differ? (You may need to install LaTeX in order to build the PDF output — RStudio will prompt you if this is necessary.)

Generally, each output will differ visually. For example, the html output uses a sans-serif font while the pdf output uses a serif font.

27.3.1 Exercises

  1. Practice what you’ve learned by creating a brief CV. The title should be your name, and you should include headings for (at least) education or employment. Each of the sections should include a bulleted list of jobs/degrees. Highlight the year in bold.

See this repo.

  1. Using the R Markdown quick reference, figure out how to:

    1. Add a footnote.
    2. Add a horizontal rule.
    3. Add a block quote.

Add a footnote:

A footnote[^1]

[^1]: Here is the footnote

A footnote1

(This footnote will be displayed at the bottom of this document.)

Add a horizontal rule:

***

Add a block quote:

> block quote

block quote

  1. Copy and paste the contents of diamond-sizes.Rmd from https://github.com/hadley/r4ds/tree/master/rmarkdown in to a local R markdown document. Check that you can run it, then add text after the frequency polygon that describes its most striking features.

28. Graphics for communication

28.2.1 Exercises

  1. Create one plot on the fuel economy data with customised title, subtitle, caption, x, y, and colour labels.
ggplot(mpg, aes(x = hwy, y = cty)) +
  geom_point(aes(colour = class)) +
  labs(
    title = "Small cars are more fuel efficient",
    subtitle = "SUVs have poor fuel economy",
    caption = "Data from fueleconomy.gov",
    x = "Highway miles per gallon",
    y = "City miles per gallon",
    colour = "Class"
  )

  1. The geom_smooth() is somewhat misleading because the hwy for large engines is skewed upwards due to the inclusion of lightweight sports cars with big engines. Use your modelling tools to fit and display a better model.

  2. Take an exploratory graphic that you’ve created in the last month, and add informative titles to make it easier for others to understand.

ggplot(iris, aes(x = Petal.Length, y = Petal.Width)) +
  geom_point(aes(colour = Species)) +
  labs(
    title = "Iris virginica and versicolor have large petals",
    subtitle = "Iris setosa have substantially smaller petals",
    caption = "Source: Edgar Anderson's Iris Data",
    x = "Petal length (cm)",
    y = "Petal width (cm)",
    colour = "Species of iris"
  )

28.3.1 Exercises

  1. Use geom_text() with infinite positions to place text at the four corners of the plot.
labels <- tibble(
  cty = c(Inf, Inf, -Inf, -Inf),
  displ = c(Inf, -Inf, Inf, -Inf),
  labels = c("Text", "tExt", "teXt", "texT"),
  vjust_var = c("top", "top", "bottom", "bottom"),
  hjust_var = c("right", "left", "right", "left")
)

ggplot(mpg, aes(x = displ, y = cty)) +
  geom_point(aes(colour = class)) +
  geom_text(data = labels, aes(label = labels, vjust = vjust_var, hjust = hjust_var))

  1. Read the documentation for annotate(). How can you use it to add a text label to a plot without having to create a tibble?
ggplot(mpg, aes(x = displ, y = cty)) +
  geom_point(aes(colour = class)) +
  annotate("text", x = 7, y = 35, label = "Text")

You can also easily add multiple labels:

ggplot(mpg, aes(x = displ, y = cty)) +
  geom_point(aes(colour = class)) +
  annotate("text", x = c(7, 1), y = c(35, 35), label = c("Text", "tExt"))

  1. How do labels with geom_text() interact with faceting? How can you add a label to a single facet? How can you put a different label in each facet? (Hint: think about the underlying data.)

  2. What arguments to geom_label() control the appearance of the background box?

The fill aesthetic. For example:

best_in_class <- mpg %>%
  group_by(class) %>%
  filter(row_number(desc(hwy)) == 1)

ggplot(mpg, aes(x = displ, y = cty)) +
  geom_point(aes(colour = class)) +
  geom_label(aes(label = model, fill = class), data = best_in_class)

  1. What are the four arguments to arrow()? How do they work? Create a series of plots that demonstrate the most important options.

The four arguments to arrow() are angle, length, ends, and type. geom_segment has an arrow argument.


  1. Here is the footnote